There are, of course, other uses for dummy variables. For instance, at work, I was examining unique users of our testing system by time of day. Our system creates a row for every action by the user, with a time stamp. If I simply generated counts of these rows during spans of time, I would get a count of actions per hour by users (clicks, highlights, etc.), rather than individual users logged in during a given hour. So I created dummy codes by hour of day, then aggregated by unique user identifier. This was how I could generate accurate counts of how many users were online during a given hour.
To apply this procedure to the reading dataset, I used the dummy_cols function to create dummy variables (or flags) for genre. I created a long-form dataset of the top genres for each title, which you can download here. For simplicity, this file only contains Book.ID, title, and genre (with a separate entry for each genre, so some books have a single row, for one genre, and others have multiple rows, to reflect multiple genres).
library(tidyverse)
longreads2019 <- read_csv("~/Downloads/Blogging A to Z/reads2019_long.csv")
I can use the dummy_cols functions to create the genres flags, that I can aggregate down and merge into the reads2019 file (I've created a version without genre flags, available here). For this function, you'll need the fastDummies package (so add install.packages("fastDummies") before the rest of the code). Also, since the number of dummy code variables typically are equal to the number of categories minus 1, the function automatically removes the first dummy variable from the final file. Since I'm using these as flags rather than dummy variables, I want to overide that default, which I do with remove_first_dummy = FALSE.
library(fastDummies) genres <- longreads2019 %>% dummy_cols(select_columns = "genre", remove_first_dummy = FALSE) genres <- genres %>% group_by(Book.ID) %>% summarise(Fiction = max(genre_Fiction), Childrens = max(genre_Childrens), Fantasy = max(genre_Fantasy), SciFi = max(genre_SciFi), Mystery = max(genre_Mystery), SelfHelp = max(genre_SelfHelp)) reads2019 <- read_csv("~/Downloads/Blogging A to Z/ReadsNoGenre.csv", col_names = TRUE)
reads2019 <- reads2019 %>% left_join(genres, by = "Book.ID")
I know I've sprinkled in other tidyverse functions in these posts, such as group_by and summarise. Don't worry! I'll post more about those functions in this series - stay tuned!
Sara .... I finally understand Dummy Variables! Thank you.
ReplyDelete