Tuesday, April 14, 2020

L is for Log Transformation

When visualizing data, outliers and skewed data can have a huge impact, potentially making your visualization difficult to understand. We can use many of the tricks covered so far to deal with those issues, such as using filters to remove extreme values. But what if you want to display all values, even extreme ones? A log transformation is a great option for displaying skewed data.

One of the more skewed variables in my reading dataset is read_time. I was able to read many books in a pretty short amount of time (a few days), but others took longer, either because they were a long book or because I was busy with other things and didn't have as much time to read. Let's take a quick look.
library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
reads2019 %$%
  range(read_time)
## [1]  0 25
Read time ranges from 0 (finished in the same day) to almost a month. If I created box-plots of reading time, I'd likely have some outliers. I'll use my Fantasy genre to generate 2 box-plots. To make these data a bit easier to visualize, I'll also change my Fantasy flag into a labeled factor.
reads2019 <- reads2019 %>%
  mutate(Fantasy = factor(Fantasy, labels = c("Non-Fantasy",
                                              "Fantasy"),
                          ordered = TRUE))
reads2019 %>%
  ggplot(aes(Fantasy, read_time)) +
  geom_boxplot()
Most of the books were finished within a couple weeks, but one fantasy book I read took longer. I could drop that value for this figure, since it does appear to be an outlier. But if I'd prefer not to drop an outlier, or if I had multiple long reads mixed in, I could keep all values and use a log-transformation to create this display. I can easily make that transformation for my figure with the scales package (add install.packages("scales") if you don't already have that package installed).
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
reads2019 %>%
  ggplot(aes(Fantasy, read_time)) +
  geom_boxplot() +
  scale_y_continuous(trans = log2_trans()) +
  ylab("Read Time (in days)") +
  labs(caption = "Because reading time was skewed, data have been log-transformed.")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
This figure is much easier to understand than the previous one. You can now very easily see the information being presented by the boxes (25th percentile, median, and 75th percentile), without the single point at the top of the y-scale squishing the boxes down. You can also see that the reading time variable is already pretty skewed, even without the outlier. 50% of the books were read in 3 days or less, but the other 50% had a much wider range for reading time. The two halves of the box plot only look equal because of the log transformation. In fact, 25% of the non-fantasy books I read took me almost 1 week to 2 weeks to read.
Just to show exactly what the log-transformation is doing, here's another version of the figure with each day as a break.
reads2019 %>%
  ggplot(aes(Fantasy, read_time)) +
  geom_boxplot() +
  scale_y_continuous(trans = log2_trans(), breaks = seq(1,25,1)) +
  ylab("Read Time (in days)") +
  labs(caption = "Because reading time was skewed, data have been log-transformed.")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
For tomorrow's post, we'll finally talk about the mutate function!

No comments:

Post a Comment