Saturday, April 4, 2020

D is for dummy_cols

For the letter D, I'm going to talk about the dummy_cols functions, which isn't actually part of the tidyverse, but hey: my posts, my rules. This function is incredibly useful for creating dummy variables, which are used in a variety of ways, including multiple regression with categorical variables. When conducting linear regression, the assumption is that both the predictor and outcome variables are numeric. To include categorical variables, you need to convert them to numeric variables. If they aren't strictly continuous, then you instead create dummy variables to represent the different categories. If I had three levels on a categorical variable, I'd need 2 dummy variables: one to delineate category 1 from the other 2, and another to delineate category 2 (with the third category being represented by 0s on the other two variables).

There are, of course, other uses for dummy variables. For instance, at work, I was examining unique users of our testing system by time of day. Our system creates a row for every action by the user, with a time stamp. If I simply generated counts of these rows during spans of time, I would get a count of actions per hour by users (clicks, highlights, etc.), rather than individual users logged in during a given hour. So I created dummy codes by hour of day, then aggregated by unique user identifier. This was how I could generate accurate counts of how many users were online during a given hour.

To apply this procedure to the reading dataset, I used the dummy_cols function to create dummy variables (or flags) for genre. I created a long-form dataset of the top genres for each title, which you can download here. For simplicity, this file only contains Book.ID, title, and genre (with a separate entry for each genre, so some books have a single row, for one genre, and others have multiple rows, to reflect multiple genres).

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

longreads2019 <- read_csv("~/Downloads/Blogging A to Z/reads2019_long.csv")

## Parsed with column specification:
## cols(
##   Book.ID = col_double(),
##   Title = col_character(),
##   genre = col_character()
## )

I can use the dummy_cols functions to create the genres flags, that I can aggregate down and merge into the reads2019 file (I've created a version without genre flags, available here). For this function, you'll need the fastDummies package (so add install.packages("fastDummies") before the rest of the code). Also, since the number of dummy code variables typically are equal to the number of categories minus 1, the function automatically removes the first dummy variable from the final file. Since I'm using these as flags rather than dummy variables, I want to overide that default, which I do with remove_first_dummy = FALSE.

library(fastDummies)

genres <- longreads2019 %>%
  dummy_cols(select_columns = "genre", remove_first_dummy = FALSE)

genres <- genres %>%
  group_by(Book.ID) %>%
  summarise(Fiction = max(genre_Fiction),
            Childrens = max(genre_Childrens),
            Fantasy = max(genre_Fantasy),
            SciFi = max(genre_SciFi),
            Mystery = max(genre_Mystery),
            SelfHelp = max(genre_SelfHelp))

reads2019 <- read_csv("~/Downloads/Blogging A to Z/ReadsNoGenre.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   NewRating = col_double(),
##   FinalRating = col_double()
## )

reads2019 <- reads2019 %>%
  left_join(genres, by = "Book.ID")

I know I've sprinkled in other tidyverse functions in these posts, such as group_by and summarise. Don't worry! I'll post more about those functions in this series - stay tuned!

1 comment:

  1. Sara .... I finally understand Dummy Variables! Thank you.

    ReplyDelete