Wednesday, April 15, 2020

M is for mutate

Today, we finally talk about the mutate function! I've used it a lot throughout the series so far, so it's nice to get to discuss what it is and how it works.

The mutate function is used anytime you want create or modify a variable. It works with pretty much any R function that creates/modifies variables, so you can wrap it around code to create factors, base R code for descriptive statistics (like median, mean, standard deviation), convert a character to a number (or vice versa), compute a date difference, and so on, as well as use any arithmetic or logical operations. You can include multiple new and/or modified variables in your mutate function, and even create or change variables using ones you created in the same mutate command. Let's demonstrate, again with the reading dataset (if you're playing along at home, once again, you can download that file here).
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )
Using this dataset, I'm going create and modify multiple variables - I'll turn some of my genre flags into factors, which aids in data visualization; convert my start and finished dates from character to date variables; extract the day of the week I started and finished each book; and label books based on whether they were recently published (in the last five years) or older.
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##     date
reads2019 <- reads2019 %>%
  mutate(Fiction = factor(Fiction,
                          levels = c(0,1),
                          labels = c("Non-Fiction", "Fiction")),
  Fantasy = factor(Fantasy, levels = c(0,1),
                   labels = c("Non-Fantasy", "Fantasy")),
  SciFi = factor(SciFi,
                 levels = c(0,1),
                 labels = c("Non-Science Fiction", "Science Fiction")),
  date_started = as.Date(reads2019$date_started, format = '%m/%d/%Y'),
  date_read = as.Date(date_read, format = '%m/%d/%Y'),
  StartDay = wday(date_started, label = TRUE, abbr = FALSE),
  FinishDay = wday(date_read, label = TRUE, abbr = FALSE),
  Age = ifelse(OriginalPublicationYear >= 2015, 1, 0),
  Age = factor(Age,
               levels = c(0,1),
               labels = c("New Publication", "Older Publication")))
Notice that I both created the Age variable and turned it into a factor, so it's okay to use a newly created variable in the same mutate wrap. I also converted my character dates into date format so that I could go on to extract day of the week (wday, a function from lubridate). In fact, this block of mutate code includes both base R and tidyverse functions. I could even pull in functions from other R packages. Basically any code you used to write as data$variable <- f(data$variable), you can embed in mutate as:

data <- data %>%
      mutate(variable = f(variable))

Sure, it seems more verbose when you're doing one quick change, but changing datasets often involve multiple variables, and may require you to use mutate with filters or group_bys. That makes tidyverse much more powerful and cleaner.

Things are heating up this week at work and I'm dangerously behind on blog posts - I wrote a bunch in advance, but got busy and am now writing this post only the evening before it goes live. Hoping to do some catchup later tonight, but Thursday's post (and/or later posts in the series) may be slightly delayed. Don't worry! We'll make it through A to Z somehow!

No comments:

Post a Comment