Deeply Trivial: R statistical package

Showing posts with label R statistical package. Show all posts

Monday, August 10, 2020

TV Shows on the "Big 3" Streaming Services

2020 has been a tough year, and I've been doing my best to keep busy (and distracted from all the insanity - both at the personal and worldwide levels). Earlier this year, I took a course in machine learning techniques and have been working on applying those techniques to work datasets, as well as fun sets through Kaggle.com.

Today, I thought I'd share another dataset I discovered through Kaggle: TV shows available on one or more streaming service (Netflix, Hulu, Prime, and Disney+). There are lots of fun things we could do with this dataset. Let's start with some basic visualization and summarization.

setwd("~/Dropbox")

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Shows <- read_csv("tv_shows.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Title = col_character(),
##   Year = col_double(),
##   Age = col_character(),
##   IMDb = col_double(),
##   `Rotten Tomatoes` = col_character(),
##   Netflix = col_double(),
##   Hulu = col_double(),
##   `Prime Video` = col_double(),
##   `Disney+` = col_double(),
##   type = col_double()
## )

First, we can do some basic summaries, such as how many shows in the dataset are on each of the streaming services.

Counts <- Shows %>%
  summarise(Netflix = sum(Netflix),
            Hulu = sum(Hulu),
            Prime = sum(`Prime Video`),
            Disney = sum(`Disney+`)) %>%
  pivot_longer(cols = Netflix:Disney,
               names_to = "Service",
               values_to = "Count")

Counts %>%
  ggplot(aes(Service,Count)) +
  geom_col()

The biggest selling point of Disney+ is to watch their movies, though the few TV shows they offer can't really be viewed elsewhere (e.g., The Mandalorian). For the sake of simplicity, we'll drop Disney+, and focus on the big 3 services for TV shows.

The dataset also contains an indicator of recommended age, which we can plot.

Shows <- Shows %>%
  mutate(Age = factor(Age,
                      labels = c("all",
                                 "7+",
                                 "13+",
                                 "16+",
                                 "18+"),
                      ordered = TRUE))

Shows %>%
  ggplot(aes(Age)) +
  geom_bar()

Many are 'NA' for age, though it isn't clear why. Are these older shows, added before these streaming services were required to add guidance on these issues? Is this issue seen more for a particular streaming site? Let's find out

Shows %>%
  group_by(Age) %>%
  summarise(Count = n(),
            Year_min = min(Year),
            Year_max = max(Year),
            Prime = sum(`Prime Video`)/2144,
            Netflix = sum(Netflix)/1931,
            Hulu = sum(Hulu)/1754)

## Warning: Factor `Age` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 6 x 7
##   Age   Count Year_min Year_max    Prime Netflix   Hulu
##   <ord> <int>    <dbl>    <dbl>    <dbl>   <dbl>  <dbl>
## 1 all       4     1995     2003 0.000466 0.00155 0     
## 2 7+     1018     1955     2020 0.0975   0.206   0.293 
## 3 13+     750     1980     2020 0.0849   0.186   0.136 
## 4 16+     848     1943     2020 0.104    0.155   0.208 
## 5 18+     545     1932     2020 0.0896   0.0886  0.0906
## 6 <NA>   2446     1901     2020 0.623    0.363   0.272

It seems the biggest "offender" for missing age information is Prime - about 62% of the shows don't have an age indicator. More surprising, though, is the minimum year for some of these categories. I'm no expert in the history of TV, but I don't think any shows were being broadcast in 1901. What are these outliers?

YearOutliers <- Shows %>%
  filter(Year < 1940)

list(YearOutliers$Title)

## [[1]]
## [1] "Born To Explore"                    "The Three Stooges"                 
## [3] "The Little Rascals Classics"        "Space: The New Frontier"           
## [5] "Gods & Monsters with Tony Robinson" "History of Westinghouse"           
## [7] "Betty Boop"

Four of these entries are clearly in error - these are newer shows. This isn't important at the moment, but it's interesting nonetheless.

In terms of getting the most "bang for your buck," Amazon Prime has the most shows to offer (though if you're looking for data on recommended age for the TV show, Prime has the most missingness). But Hulu and Netflix, in terms of volume, are pretty comparable to Prime. What can be said about the quality of content on each of the 3?

The dataset offers some indicators of quality: IMDb rating and Rotten Tomatoes score. How do the 3 services measure up on these indicators?

Netflix <- Shows %>%
  filter(Netflix == 1) %>%
  select(IMDb, `Rotten Tomatoes`) %>%
  mutate(Service = "Netflix")

Hulu <- Shows %>%
  filter(Hulu == 1) %>%
  select(IMDb, `Rotten Tomatoes`) %>%
  mutate(Service = "Hulu")

Prime <- Shows %>%
  filter(`Prime Video` == 1) %>%
  select(IMDb, `Rotten Tomatoes`) %>%
  mutate(Service = "Prime")

BigThree <- rbind(Netflix, Hulu, Prime)

BigThree <- BigThree %>%
  mutate(RotTom = as.numeric(sub("%","",`Rotten Tomatoes`))/100)

BigThree %>%
  ggplot(aes(Service, IMDb)) +
  geom_boxplot()

## Warning: Removed 1194 rows containing non-finite values (stat_boxplot).

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

BigThree %>%
  ggplot(aes(Service, RotTom)) +
  geom_boxplot() +
  scale_y_continuous(labels = percent)

## Warning: Removed 4772 rows containing non-finite values (stat_boxplot).

It doesn't appear the 3 streaming services differ too much in terms of quality. But there's more analysis we can do of this dataset. More later.

Thursday, June 25, 2020

Flying Saucers and Bright Lights: A Data Visualization

UFO Sightings by Shape and Year

Earlier last week, I taught part 2 of a course on using R and tidyverse for my work colleagues. I wanted a fun dataset to use as an example for coding exercises throughout. There was really only one choice.

I found this great dataset through kaggle.com - UFO sightings reported to the National UFO Reporting Center (NUFORC) through 2014. This dataset gave lots of variables we could play around with, and I'd like to use it in a future session with my colleagues to talk about the process of cleaning data.

If you're interested in learning more about R and tidyverse, you can access my slides from the sessions here. (We stopped at filtering and picked up there for part 2, so everything is in one Powerpoint file.)

While working with the dataset to plan my learning sessions, I started playing around and thought it would be fun to show the various shapes of UFOs reported over time, to see if there were any shifts. Spoiler: There were. But I needed to clean the data a bit first.

setwd("~/Downloads/UFO Data")
library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

options(scipen = 999)

UFOs <- read_csv("UFOsightings.csv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   datetime = col_character(),
##   city = col_character(),
##   state = col_character(),
##   country = col_character(),
##   shape = col_character(),
##   `duration (seconds)` = col_double(),
##   `duration (hours/min)` = col_character(),
##   comments = col_character(),
##   `date posted` = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )

## Warning: 4 parsing failures.
##   row                col               expected   actual               file
## 27823 duration (seconds) no trailing characters `        'UFOsightings.csv'
## 35693 duration (seconds) no trailing characters `        'UFOsightings.csv'
## 43783 latitude           no trailing characters q.200088 'UFOsightings.csv'
## 58592 duration (seconds) no trailing characters `        'UFOsightings.csv'

There are 30 shapes represented in the data. That's a lot to show in a single figure.

UFOs %>%
  summarise(shapes = n_distinct(shape))

## # A tibble: 1 x 1
##   shapes
##    <int>
## 1     30

If we look at the different shapes in the data, we can see some overlap, as well as shapes with low counts.

UFOs %>%
  group_by(shape) %>%
  summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 30 x 2
##    shape    count
##    <chr>    <int>
##  1 changed      1
##  2 changing  1962
##  3 chevron    952
##  4 cigar     2057
##  5 circle    7608
##  6 cone       316
##  7 crescent     2
##  8 cross      233
##  9 cylinder  1283
## 10 delta        7
## # ... with 20 more rows

For instance, "changed" only appears in one record. But "changing," which appears in 1,962 records should be grouped with "changed." After inspecting all the shapes, I identified the following categories that accounted for most of the different shapes:

changing, which includes both changed and changing
circles, like disks, domes, and spheres
triangles, like deltas, pyramids, and triangles
four or more sided: rectangles, diamonds, and chevrons
light, which counts things like flares, fireballs, and lights

I also made an "other" category for shapes with very low counts that didn't seem to fit in the categories above, like crescents, teardrops, and formations with no further specification of shape. Finally, shape was blank for some records, so I made an "unknown" category. Here's the code I used to recategorize shape.

changing <- c("changed", "changing")
circles <- c("circle", "disk", "dome", "egg", "oval","round", "sphere")
triangles <- c("cone","delta","pyramid","triangle")
fourormore <- c("chevron","cross","diamond","hexagon","rectangle")
light <- c("fireball","flare","flash","light")
other <- c("cigar","cylinder","crescent","formation","other","teardrop")
unknown <- c("unknown", 'NA')

UFOs <- UFOs %>%
  mutate(shape2 = ifelse(shape %in% changing,
                         "changing",
                         ifelse(shape %in% circles,
                                "circular",
                                ifelse(shape %in% triangles,
                                       "triangular",
                                       ifelse(shape %in% fourormore,
                                              "four+-sided",
                                              ifelse(shape %in% light,
                                                     "light",
                                                     ifelse(shape %in% other,
                                                            "other","unknown")))))))

My biggest question mark was cigar and cylinder. They're not really circles, nor do they fall in the four or more sided category. I could create another category called "tubes," but ultimately just put them in other. Using the code above as an example, you could see what happens to the chart if you put them in another category or create one of their own.

For the chart, I dropped the unknowns.

UFOs <- UFOs %>%
  filter(shape2 != "unknown")

Now, to plot shapes over time, I need to extract date information. The "datetime" variable is currently a character, so I have to convert that to a date. I then pulled out year, so that each point on my figure was the count of that shape observed during a given year.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

UFOs <- UFOs %>%
  mutate(Date2 = as.Date(datetime, format = "%m/%d/%Y"),
         Year = year(Date2))

Now we have all the information we need to plot shapes over time, to see if there have been changes. We'll create a summary dataframe by Year and shape2, then create a line chart with that information.

Years <- UFOs %>%
  group_by(Year, shape2) %>%
  summarise(count = n())

## `summarise()` regrouping output by 'Year' (override with `.groups` argument)

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(ggthemes)

Years %>%
  ggplot(aes(Year, count, color = shape2)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(1910,2020,10)) +
  scale_y_continuous(breaks = seq(0,3000,500), labels = comma) +
  labs(color = "Object Shape", title = "From Flying Saucers to Bright Lights:\nSightings of UFO Shapes Over Time") +
  ylab("Number of Sightings") +
  theme_economist_white() +
  scale_color_tableau() +
  theme(plot.title = element_text(hjust = 0.5))

Until the mid-90s, the most commonly seen UFO was circular. After that, light shapes became much more common. I'm wondering if this could be explained in part by UFOs in pop culture, moving from the flying saucers of earlier sci-fi to the bright lights without discernible shape in the more recent sci-fi. The third most common shape is our "other" category, which suggests we might want to rethink that one. It could be that some of the shapes within that category are common enough to warrant their own category, while receiving other for those that don't have a good category of their own. Cigar and cylinder, for instance, have high counts and could be put in their own category. Feel free to play around with the data and see what you come up with!

Sunday, May 3, 2020

Statistics Sunday: My 2019 Reading

I've spent the month of April blogging my way through the tidyverse, while using my reading dataset from 2019 as the example. Today, I thought I'd bring many of those analyses and data manipulation techniques together to do a post about my reading habits for the year.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

As you recall, I read 87 books last year, by 42 different authors.

reads2019 %>%
  summarise(Books = n(),
            Authors = n_distinct(Author))

## # A tibble: 1 x 2
##   Books Authors
##   <int>   <int>
## 1    87      42

Using summarise, we can get some basic information about each author.

authors <- reads2019 %>%
  group_by(Author) %>%
  summarise(Books = n(),
            Pages = sum(Pages),
            AvgRating = mean(MyRating),
            Oldest = min(OriginalPublicationYear),
            Newest = max(OriginalPublicationYear),
            AvgRT = mean(read_time),
            Gender = first(Gender),
            Fiction = sum(Fiction),
            Childrens = sum(Childrens),
            Fantasy = sum(Fantasy),
            Sci = sum(SciFi),
            Mystery = sum(Mystery))

Let's plot number of books by each author, with the bars arranged by number of books.

authors %>%
  ggplot(aes(reorder(Author, desc(Books)), Books)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Author")

I could simplify this chart quite a bit by only showing authors with 2 or more books in the set, and also by flipping the axes so author can be read along the side.

authors %>%
  mutate(Author = fct_reorder(Author, desc(Author))) %>%
  filter(Books > 1) %>%
  ggplot(aes(reorder(Author, Books), Books)) +
  geom_col() +
  coord_flip() +
  xlab("Author")

Based on this data, I read the most books by L. Frank Baum (which makes sense, because I made a goal to reread all 14 Oz series books), followed by Terry Pratchett (which makes sense, because I love him). The code above is slightly more complex, because when I use coord_flip(), the author names were displayed in reverse alphabetical order. Using the factor reorder code plus the reorder in ggplot allowed me to display the chart in order by number of books then by author alphabetical order.

We can also plot average rating by author, which can tell me a little more about how much I like particular authors. Let's plot those for any author who contributed at least 2 books to my dataset.

authors %>%
  filter(Books > 1) %>%
  ggplot(aes(Author, AvgRating)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  ylab("Average Rating")

I only read 2 books by Ann Patchett, but I rated both of her books as 5, giving her the highest average rating. If I look at one of the authors who contributed more than 2 books, John Scalzi (tied for 3rd most read in 2019) has the highest rating, followed by Terry Pratchett (2nd most read). Obviously, though, I really like any of the authors I read at least 2 books from, because they all have fairly high average ratings. Stephen King is the only one with an average below 4, and that's only because I read Cujo, which I hated (more on that later on in this post).

We can also look at how genre affected ratings. Using the genre labels I generated before, let's plot average rating.

genre <- reads2019 %>%
  group_by(Fiction, Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n(),
            AvgRating = mean(MyRating)) %>%
  bind_cols(Genre = c("Non-Fiction",
           "General Fiction",
           "Mystery",
           "Science Fiction",
           "Fantasy",
           "Fantasy Sci-Fi",
           "Children's Fiction",
           "Children's Fantasy"))

genre %>%
  ggplot(aes(reorder(Genre, desc(AvgRating)), AvgRating)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  xlab("Genre") +
  ylab("Average Rating")

Based on this plot, my favorite genres appear to be fantasy, sci-fi, and especially books with elements of both. No surprises here.

Let's dig into ratings on individual books. In my filter post, I identified the 25 books I liked the most (i.e., gave them a 5-star rating). What about the books I disliked? The lowest rating I gave was a 2, but it's safe to say I hated those books. And I also probably didn't like the books I rated as 3.

lowratings <- reads2019 %>%
  filter(MyRating <= 3) %>%
  mutate(Rating = case_when(MyRating == 2 ~ "Hated",
                   MyRating == 3 ~ "Disliked")) %>%
  arrange(desc(MyRating), Author) %>%
  select(Title, Author, Rating)

library(expss)

## 
## Attaching package: 'expss'

## The following objects are masked from 'package:stringr':
## 
##     fixed, regex

## The following objects are masked from 'package:dplyr':
## 
##     between, compute, contains, first, last, na_if, recode, vars

## The following objects are masked from 'package:purrr':
## 
##     keep, modify, modify_if, transpose

## The following objects are masked from 'package:tidyr':
## 
##     contains, nest

## The following object is masked from 'package:ggplot2':
## 
##     vars

as.etable(lowratings, rownames_as_row_labels = FALSE)

Title	Author	Rating
The Scarecrow of Oz (Oz, #9)	Baum, L. Frank	Disliked
The Tin Woodman of Oz (Oz, #12)	Baum, L. Frank	Disliked
Herself Surprised	Cary, Joyce	Disliked
The 5 Love Languages: The Secret to Love That Lasts	Chapman, Gary	Disliked
Boundaries: When to Say Yes, How to Say No to Take Control of Your Life	Cloud, Henry	Disliked
Summerdale	Collins, David Jay	Disliked
When We Were Orphans	Ishiguro, Kazuo	Disliked
Bird Box (Bird Box, #1)	Malerman, Josh	Disliked
Oz in Perspective: Magic and Myth in the L. Frank Baum Books	Tuerk, Richard	Disliked
Cujo	King, Stephen	Hated
Just Evil (Evil Secrets Trilogy, #1)	McKeehan, Vickie	Hated

I'm a little surprised at some of this, because several books I rated as 3 I liked and only a few I legitimately didn't like. The 2 books I rated as 2 I really did hate, and probably should have rated as 1 instead. So based on my new understanding of how I've been using (misusing) those ratings, I'd probably update 3 ratings.

reads2019 <- reads2019 %>%
  mutate(MyRating = replace(MyRating,
                            MyRating == 2, 1),
         MyRating = replace(MyRating,
                            Title == "Herself Surprised", 2))

lowratings <- reads2019 %>%
  filter(MyRating <= 2) %>%
  mutate(Rating = case_when(MyRating == 1 ~ "Hated",
                   MyRating == 2 ~ "Disliked")) %>%
  arrange(desc(MyRating), Author) %>%
  select(Title, Author, Rating)

library(expss)

as.etable(lowratings, rownames_as_row_labels = FALSE)

Title	Author	Rating
Herself Surprised	Cary, Joyce	Disliked
Cujo	King, Stephen	Hated
Just Evil (Evil Secrets Trilogy, #1)	McKeehan, Vickie	Hated

There! Now I have a much more accurate representation of the books I actually disliked/hated, and know how I should be rating books going forward to better reflect how I think of the categories. Of the two I hated, Just Evil... was an e-book I won in a Goodreads giveaway that I read on my phone when I didn't have a physical book with me: convoluted storyline, problematic romantic relationships, and a main character who talked about how much her dog was her baby, and yet the dog was forgotten half the time (even left alone for long periods of time while she was off having her problematic relationship) except when the dog's reaction or protection became important to the storyline. The other, Cujo, I reviewed here; while I'm glad I read it, I have no desire to ever read it again.

Let's look again at my top books, but this time, classify them by long genre descriptions from above. I can get that information into my full reading dataset with a join, using the genre flags. Then I can plot the results from that dataset without having to summarize first.

topbygenre <- reads2019 %>%
  left_join(genre, by = c("Fiction","Childrens","Fantasy","SciFi","Mystery")) %>%
  select(-Books, -AvgRating) %>%
  filter(MyRating == 5)

topbygenre %>%
  ggplot(aes(fct_infreq(Genre))) +
  geom_bar() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  xlab("Genre") +
  ylab("Books")

This chart helps me to better understand my average rating by genre chart above. Only 1 book with elements of both fantasy and sci-fi was rated as a 5, and the average rating is 4.5, meaning there's only 1 other book in that category that had to be rated as a 4. It might be a good idea to either filter my genre rating table to categories with more than 1 book, or add the counts as labels to that plot. Let's try the latter.

genre %>%
  ggplot(aes(reorder(Genre, desc(AvgRating)), AvgRating, label = Books)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  xlab("Genre") +
  ylab("Average Rating") +
  geom_text(aes(x = Genre, y = AvgRating-0.25), size = 5,
                color = "white")

Let's redo this chart, excluding those genres with only 1 or 2 books represented.

genre %>%
  filter(Books > 2) %>%
  ggplot(aes(reorder(Genre, desc(AvgRating)), AvgRating, label = Books)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  xlab("Genre") +
  ylab("Average Rating") +
  geom_text(aes(x = Genre, y = AvgRating-0.25), size = 5,
                color = "white")

While I love both science fiction and fantasy - reading equal numbers of books in those genres - I seem to like science fiction a bit more, based on the slightly higher average rating.

Thursday, April 30, 2020

Z is for Additional Axes

Here we are at the last post in Blogging A to Z! Today, I want to talk about adding additional axes to your ggplot, using the options for fill or color. While these aren't true z-axes in the geometric sense, I think of them as a third, z, axis.

Some of you may be surprised to learn that fill and color are different, and that you could use one or both in a given plot.

Color refers to the outline of the object (bar, piechart wedge, etc.), while fill refers to the inside of the object. For scatterplots, the default shape doesn't have a fill, so you'd just use color to change the appearance of those points.

Let's recreate the pages read over 2019 chart, but this time, I'll just use fiction books and separate them as either fantasy or other fiction; this divides that dataset pretty evenly in half. Here's how I'd generate the pages read over time separately by those two genre categories.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

fantasy <- reads2019 %>%
  filter(Fiction == 1) %>%
  mutate(date_read = as.Date(date_read, format = '%m/%d/%Y'),
         Fantasy = factor(Fantasy, levels = c(0,1),
                          labels = c("Other Fiction",
                                     "Fantasy"))) %>%
  group_by(Fantasy) %>%
  mutate(GenreRead = order_by(date_read, cumsum(Pages))) %>%
  ungroup()

Now I'd just plug that information into my ggplot code, but add a third variable in the aesthetics (aes) for ggplot - color = Fantasy.

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

myplot <- fantasy %>%
  ggplot(aes(date_read, GenreRead, color = Fantasy)) +
  geom_point() +
  xlab("Date") +
  ylab("Pages") +
  scale_x_date(date_labels = "%b",
               date_breaks = "1 month") +
  scale_y_continuous(labels = comma, breaks = seq(0,30000,5000)) +
  labs(color = "Genre of Fiction")

This plot uses the default R colorscheme. I could change those colors, using an existing colorscheme, or define my own. Let's make a fivethirtyeight style figure, using their theme for the overall plot, and their color scheme for the genre variable.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

myplot +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight()

I can also specify my own colors.

myplot +
  scale_color_manual(values = c("#4b0082","#ffd700")) +
  theme_minimal()

The geom_point offers many point shapes; 21-25 allow you to specify both color and fill. But for the rest, only use color.

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 3.6.3

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

ggpubr::show_point_shapes()

## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

Of course, you may have plots where changing fill is best, such as on a bar plot. In my summarize example, I created a stacked bar chart of fiction versus non-fiction with author gender as the fill.

reads2019 %>%
  mutate(Gender = factor(Gender, levels = c(0,1),
                         labels = c("Male",
                                    "Female")),
         Fiction = factor(Fiction, levels = c(0,1),
                          labels = c("Non-Fiction",
                                     "Fiction"),
                          ordered = TRUE)) %>%
  group_by(Gender, Fiction) %>%
  summarise(Books = n()) %>%
  ggplot(aes(Fiction, Books, fill = reorder(Gender, desc(Gender)))) +
  geom_col() +
  scale_fill_economist() +
  xlab("Genre") +
  labs(fill = "Author Gender")

Stacking is the default, but I could also have the bars next to each other.

reads2019 %>%
  mutate(Gender = factor(Gender, levels = c(0,1),
                         labels = c("Male",
                                    "Female")),
         Fiction = factor(Fiction, levels = c(0,1),
                          labels = c("Non-Fiction",
                                     "Fiction"),
                          ordered = TRUE)) %>%
  group_by(Gender, Fiction) %>%
  summarise(Books = n()) %>%
  ggplot(aes(Fiction, Books, fill = reorder(Gender, desc(Gender)))) +
  geom_col(position = "dodge") +
  scale_fill_economist() +
  xlab("Genre") +
  labs(fill = "Author Gender")

You can also use fill (or color) with the same variable you used for x or y; that is, instead of having it be a third scale, it could add some color and separation to distinguish categories from the x or y variable. This is especially helpful if you have multiple categories being plotted, because it helps break up the wall of bars. If you do this, I'd recommend choosing a color palette with highly complementary colors, rather than highly contrasting ones; you probably also want to drop the legend, though, since the axis will also be labeled.

genres <- reads2019 %>%
  group_by(Fiction, Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n())

genres <- genres %>%
  bind_cols(Genre = c("Non-Fiction",
           "General Fiction",
           "Mystery",
           "Science Fiction",
           "Fantasy",
           "Fantasy Sci-Fi",
           "Children's Fiction",
           "Children's Fantasy"))

genres %>%
  filter(Genre != "Non-Fiction") %>%
  ggplot(aes(reorder(Genre, -Books), Books, fill = Genre)) +
  geom_col() +
  xlab("Genre") +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  scale_fill_economist() +
  theme(legend.position = "none")

If you only have a couple categories and want to draw a contrast, that's when you can use contrasting shades: for instance, at work, when I plot performance on an item, I use red for incorrect and blue for correct, to maximize the contrast between the two performance levels for whatever data I'm presenting.

I hope you enjoyed this series! There's so much more you can do with tidyverse than what I covered this month. Hopefully this has given you enough to get started and sparked your interest to learn more. Once again, I highly recommend checking out R for Data Science.