Deeply Trivial

Noise in the Middle: Movie Review

2021-04-01T20:35:00.004-05:00

I've been on a horror movie kick for a while (as I've said before particularly here and here, I love a good horror movie, and I also think that after the last year+ of insanity, nothing really scares me anymore, or at least fiction doesn't scare me more than reality). I've been checking out every horror movie I can find on my various streaming services and, well, I've definitely watched some crappy ones. Maybe I'll blog about them sometime.

This evening, I watched Noise in the Middle, the story of a grieving widow and his daughter with severe autism, who seek out an experimental treatment (what appears to be transcranial magnetic stimulation therapy or something like it). What he doesn't realize is that the house he rents is haunted by an occult-loving sadist and the ghosts of the children from his poor house/orphanage that he bound to be trapped in the house after their death. Or something. It's not completely clear but it apparently involved branding the children with an infinity symbol and also the children killing him and themselves with a fire. Or something.

The concept was promising - although I find the "kid with autism has special powers (in this case, is a conduit and can see spirits)" concept to be problematic, just like I found the "woman with dementia is actually possessed" concept to be problematic in The Taking of Deborah Logan - and the movie started off great. We established the background, got some ominous shots and glowing eyes in dark rooms. We also saw some really interesting symbolic imagery after Emmy's (the daughter) treatments with TMS, very Ring-video type images, which could have been used more fully in connection with the haunted house and the concept but sadly was not. We even had the "person randomly finds occult shop/enthusiast who believes the main character and helps them" trope used for more humorous and uplifting effect.

In the middle, things began to drag and become more convoluted, which I thought might be used to tie in the symbolic imagery from Emmy's sessions, but sadly was not. The end was just a big old mess. It felt like the writer had a great idea, spent lots of time on the beginning, lost steam in the middle, and then had to just finish the damn thing by the end. The movie toyed with so many horror concepts (haunted house starts to bring out the darkness in/infect the father, like The Shining; seemingly random images have more concrete meaning for the mystery, like Ringu/The Ring; grief manifested as a spirit or entity, like The Babadook) but never really fully committed to any of it.

Overall, I'd say don't bother with this one. The beginning made me have high expectations that this movie would be good/meaningful/even a little scary, but I ended up with "WTF did I just watch and why?".

Super Bowl Reactions

2021-02-07T22:58:00.004-06:00

I watched the Super Bowl tonight rooting hard for my Chiefs - I was even wearing a Mahomes jersey (full disclosure: it was supposed to be a Kelce jersey, since he's my favorite player, but due to a royal f-up by the post office, that jersey never arrived, so I was able to get a quick backup Mahomes, my second favorite player, jersey). I was disheartened that my Chiefs lost, but am happy for the Buccs to make an amazing comeback as a franchise (except you, Tom Brady, I still don't like you).

So my reactions:

America the Beautiful by H.E.R., the National Anthem by Eric Church and Jazmine Sullivan, and the poem by Amanda Gorman were wonderful. Honoring our frontline workers and the message from our President and First Lady - beautiful. Plus the first female ref at the Super Bowl - all the feels.

The penalties were a bit ridiculous, and mostly being called on the Chiefs. It's ballsy to call pass interference on an uncatchable pass. I understand that many penalty calls in football are based on what we scientists call the counterfactual - what would have happened if a condition (such as, a defensive player pushing a receiver out of the way) was not present - but when a call equals free yards or another try at a down, they need to be used thoughtfully. The penalty calls felt very one-sided. Yes, the Buccs finally got their own "unsportsmanlike conduct" call as well as a much-needed "roughing the passer" (but it took 3 guys hitting Mahomes in much the same whiplash way that caused a concussion 2 games ago). They say homefield advantage doesn't exist in playoff games. I beg to differ.

I'm surprised at the hate I'm seeing about the Halftime Show by The Weeknd. I went into halftime knowing a couple songs by him, and finished it as a fan. We're used to these blow-out halftime shows with 3 big-name artists plus 10 high schools worth of marching bands and drill teams on the field, but in COVID-land, that's just not possible. Instead, we got an artist who was able to showcase choral and dancing talent while still respecting social distance and safety. The dancers wrapped in face bandages for "I Can't Feel My Face" was super clever - guys, those were face masks! (NOT JOCK STRAPS, as some have joked.) They were able to have dancers in close quarters wearing face masks in a way that made sense with the song. In fact, they looked so little like face masks that... see jock strap comment. I was super impressed!

In the second half, we saw a bit of the old Chiefs, but sadly not enough to score a single point. The Buccs' defensive line was just too good - I mean, they ran a blitz on every f***ing play, and our offensive line couldn't hold them back long enough to give Mahomes as much time in the pocket as he's used to. This is something to work on for next season. Mahomes is an amazing quarterback but he's used to hanging in the pocket long enough to survey the field, pick his receiver, and pass; let's work on decreasing the time he needs in the pocket. And let's work on an offensive line that can predict how the blitz is going to work and knock those guys down. Yeah, a team that blitzes on every defensive play is unusual, but as we saw tonight, IT HAPPENS! Practice defusing a blitz from every angle.

Also, WHY DIDN'T YOU SHOW US AN INSTANT REPLAY ON THE RANDOM FAN ON THE FIELD?! I wanted to see that again/closer up.

Overall, I'm sad the Chiefs lost and annoyed at the one-sided-ness/overeagerness of the penalties. I enjoyed the game, the commercials (ALL the celebs came out for those, including a Wayne's World call-back with Cardi B???!), and the performances. I'm happy for the Buccs and hopeful for the Chief's next season (I mean, winning Super Bowl last year plus being AFC Champs again this year is nothing to sneeze at). And okay, Tom Brady proved that a quarterback can still be good and (pretty much automatically, because we always honor quarterbacks and ignore the other positions - like how about TE Gronkowski?) be Super Bowl MVP at 43. You're on top, dude; how about you retire?

My Dark Vanessa: Book Review

2021-01-04T15:10:00.000-06:00

Content Warning: sexual abuse, child abuse, and rape

Earlier today, I finished My Dark Vanessa, a debut novel by Kate Elizabeth Russell. The book is told by Vanessa Wye, a young woman who was abused by her boarding school English teacher starting when she was 15. The book spans 17 years, jumping between Vanessa's youth and adulthood. Before I get into my (slightly spoiler-y) review, I want to say: I loved this book, and I also have no desire to ever read it again.

As you can imagine from the title and brief synopsis, this a difficult book, as we hear everything that happened in the mind of a young woman who was gaslighted into believing she had all the power in situations when she had almost none and at the same time, that she had no power in situations where she could do something to stop the abuse. It's a deep and disturbing dive into the way an older man selects and grooms his victim, changing her thinking and behavior for decades, and convincing her that she's a willing participant, even when she describes very clear dissociation (being outside of herself) during the episodes of abuse - a reaction often seen in victims of child abuse.

The book also digs into two really key issues, that I haven't often seen explored or explored this well: 1. The narrative that women have "feminine gifts" that allow them to have power over a man, and make these men do things they wouldn't otherwise do. And 2. That coming forward is the only responsible thing a victim of such abuse can do.

The first issue (the "power" of femininity to take away men's agency) is such a pervasive part of rape culture. But this book also explores how this narrative has been romanticized to even apply to situations of a very young girl and a much older man, in stories like Lolita, American Beauty, and Pretty Baby. It is this romanticization and narrative that makes Vanessa continue her relationship with her abuser, Jacob Strane, even when it actively hurts her. He convinces her that he has so much more to lose than she does if their "love" becomes public, that he cannot help himself, that she has the power in the situation to consent or decline (even when he ignores her requests for him to stop and/or fails to ask for consent for very extreme sexual acts), and that, most of all, she is special because of this power she has. For a lonely girl, away from home for the first time, it's so easy to see how he selects and grooms her. But perhaps one of the most frustrating things is, even as I was reading and feeling what Vanessa feels, the descriptions and behaviors were so clear, I would shout at her as I read that there's some textbook-level gaslighting going on. It's why this is such a good book - that the author can give us those really clear cues while still telling the book in first-person, and avoid the "unreliable narrator" trope - and also one I hope to never read again.

This narrative of feminine wiles is perhaps ones of biggest issues we need to contend with if we want to do away with rape culture for good. It's a narrative that, on its surface, appears to assign all the power to the woman and none to her rapist or abuser, when at its core, it instead makes the woman powerless to stop (and deserving of) whatever harm is done to her. It's also a narrative that can be so easily spun as a positive thing when it is actually toxic and harmful.

The second issue is a bit more ambiguous, at least for me, because before I read this book, I would have agreed with this second statement, that victims must come forward so that the abuser can be brought to justice and that others can be protected. I believed this even as a person who did not bring my own rapist to justice, something I was very ashamed of about myself. But this book made me realize just how tricky this issue is.

At a surface level, it seems like a conflict between the needs of the individual and the needs of many, and from a philosophical standpoint, the needs of the many should outweigh the needs of the individual. But framing it in such a way takes away the individual's autonomy, a major issue considering that the abuse/rape was all about taking away one individual's autonomy. And victims already feel a great deal of guilt and self-blame for the event; they don't need the guilt of believing they failed others, or that they are in some way responsible for the reprehensible actions of another.

Framing it as needs of the individual vs. needs of the many oversimplifies exactly what the needs of the individuals are (privacy, self-care, fear of reprisals, and so on), while also making that individual an accomplice in how another person's actions affects the many. In the case of adults in positions of power abusing the people they should be protecting, no victim should ever be to blame; this is on the system that put (and often helps to keep) that person in power, and on all of us, for the ways (big and small) that we may contribute to these power dynamics and rape culture.

This book was very triggering for me (even though my personal experiences do not resemble Vanessa), and I'm still working through the emotions it's brought up. I was reminded of a book I read in college, Bastard Out of Carolina, which also details years of sexual abuse of a child. When I finished that book, I threw it against the wall. Fortunately this book didn't elicit that reaction, but I didn't have a super-positive reaction to the ending either.

I'm still glad I read it, though I probably wouldn't recommend it to anyone who might also be triggered, especially if they haven't been able to work through their own trauma through therapy or treatment. And I'll definitely keep an eye out for future books by Kate Elizabeth Russell.

Some Music for Your Holidays

2020-12-19T11:11:00.003-06:00

Hey everyone,

One thing I've been doing during the pandemic is making music on my own. For our holiday season, I dropped my very first album: Winter Delights. You can read about the album and download tracks here or stream me on Soundcloud. I'm working on more arrangements (and upgraded my audio recording equipment) so I'm hoping to drop a full album early in the New Year!

And to give you a little extra something, here's a selection of performances from my choir's annual cabaret benefit, Apollo After Hours:

A Follow-Up on Yesterday's Sexist Nonsense

2020-12-13T13:02:00.004-06:00

Unsurprisingly, I'm not the only one who found Joseph Epstein's op-ed enraging. I give you this delicious takedown from Amanda Kohlhofer.

A privileged white man with no post-grad education telling a woman with a doctorate not to use her credentials. How very original of you, kiddo.

To that end, let’s list Dr. Biden’s accomplishments:

She earned a Bachelor of Arts in English from the University of Delaware in 1975.

She earned a Master of Education, with a specialty in reading, from West Chester State College in 1981.

She earned a Master of Arts in Education from Villanova University in 1987.

She earned a Doctor of Education (Ed.D) in educational leadership from the University of Delaware in 2007
She accomplished all of this over the span of 32 years, all while becoming a wife, raising children, teaching at many different levels, running a non-profit, and accompanying her husband through multiple political campaigns. (And, who wants to tell him that not only has she earned all of these degrees, but she has also, in fact, delivered a child?)

Just as I did, Kohlhofer suspects this piece would never have been written if Jill Biden were a man. And even though Epstein's blatant sexism is very obviously jealousy over a woman who is more educated, there are definitely people who casually drop the Dr. (or refuse to even recognize that the title could be Dr.) among women more than men.

In 2011, I earned a PhD in Social Psychology. I worked for many years as a health services researcher in the Department of Veterans Affairs, where I regularly worked with PhDs, MDs, and some of those crazy smart people with both. We all called each other by first name. (Except for colleagues who had just earned their doctorate - we called them Dr. at every opportunity until they got sick of it and begged us to go back to first name. Why? Because earning a doctorate is a freaking amazing achievement!) In college and grad school, we all called each other by first name. Academia or medicine is not what Kiddo Joe envisions of a bunch of people calling each other Dr. It was all pretty casual.

BUT there are times when that title should be used, such as when introducing a panel of presenters at a conference. And it was very telling how the moderators would often introduce the men as Dr. So-and-So and the women by their first name. It's telling the number of times people have asked me if my title is Mrs. or Ms. in some of these types of settings. It's telling that when I worked at a hospital, people would immediately say, "Oh, you must be a nurse." Why not a doctor? (And even more interesting is when I was married, people would ask my husband what he did for a living but would often ask me if I work.)

Women, either with or without higher degrees, constantly have to work harder to prove themselves. Gatekeeping is alive and well, not just in gamers and sports fans communities, testing women to see if they're legit, but in pretty much any field. I've interacted with fellow psychometrician and data scientists who I'm sure would prefer to call me "Kiddo" instead of Dr., or who waste valuable meeting time explaining core concepts "for Sara's benefit." I once had a psychometrician describe a concept and then urge me to read the chapter on this topic in the recent edition of the Institute of Credentialing Excellence Handbook. I was second author of that chapter.

And as Epstein demonstrates, gatekeeping doesn't even have to come from someone with the same background or credentials. It can be some dude with a BA writing in the WSJ.

Guys, women are exhausted with this nonsense. When interacting with a woman in a professional or academic environment, be aware of those little microaggressions, or the things you may be doing that make her have to work that much harder to be believed or respected. Introduce people with their titles. Assume women know about something unless they say otherwise. Stop wasting everyone's time and energy. And stop telling us to hang up our titles.

Sexist Nonsense in the Wall Street Journal

2020-12-12T14:42:00.005-06:00

I really wish this were satire, but Joseph Epstein's recent opinion piece in the Wall Street Journal is, sadly, a completely earnest bit of mansplaining and suspicion of the intellectual elite:

Madame First Lady—Mrs. Biden—Jill—kiddo: a bit of advice on what may seem like a small but I think is a not unimportant matter. Any chance you might drop the “Dr.” before your name? “Dr. Jill Biden ” sounds and feels fraudulent, not to say a touch comic. Your degree is, I believe, an Ed.D., a doctor of education, earned at the University of Delaware through a dissertation with the unpromising title “Student Retention at the Community College Level: Meeting Students’ Needs.” A wise man once said that no one should call himself “Dr.” unless he has delivered a child. Think about it, Dr. Jill, and forthwith drop the doc.

Epstein goes on to explain that he holds no higher degrees, other than an honorary doctorate. He talks of the hilarity of people referring to him by the title Dr. Yes, it is hilarious, because honorary doctorates are merely a beefed up way of thanking someone for speaking at a university, not recognition following years of hard work to demonstrate that one has earned a title that allows that person to be considered an expert. You see, that's what doctor means - expert. An M.D. is an expert in medicine, a person with a PhD is an expert in the subject of that PhD, and so on. Epstein's honorary doctorate is really more like the prize in a box of cereal. Yeah, he had to do some work for it, but nowhere near on par with the work Dr. Jill Biden did for hers.

Epstein also laments that doctoral requirements have gotten lax in recent years, which is rich coming from someone who has never attempted to earn a doctorate.

Getting a doctorate was then an arduous proceeding: One had to pass examinations in two foreign languages, one of them Greek or Latin, defend one’s thesis, and take an oral examination on general knowledge in one’s field. At Columbia University of an earlier day, a secretary sat outside the room where these examinations were administered, a pitcher of water and a glass on her desk. The water and glass were there for the candidates who fainted.

Is he correct that the doctoral examination no longer looks like this? Yes. There is no exam in Greek or Latin, nor an oral exam of general knowledge. But that's because the structure of doctoral education has shifted. In the past, doctoral education was very self-directed, with candidates choosing a course of study and pursuing it mostly on his (or her - but let's be real, back in the day mostly his) own. Candidates might spend years lurking around dark, dusty libraries, looking for some groundbreaking thesis to pursue. At the end, it was necessary to show that time hadn't simply been spent trying to write the most off-the-wall contribution to general knowledge, but that the candidate had also learned enough about the field of study to recognize how their contribution fits.

Today? Anyone interested in pursuing a doctorate must complete a certain amount of coursework, some elective but much of it required to establish the requisite knowledge in the chosen field. After that, they must also complete candidacy exams, which may be oral, as Epstein describes above, or written, or some combination. The point is to ensure the candidate has the foundational knowledge necessary to become an expert in the field. Then - and only then - can the candidate propose a dissertation. Other than Greek and Latin, the requirements are much the same, and in some ways, more stringent.

Honestly, not only do I think Epstein's dismissal of Dr. Biden's doctorate is ridiculous coming from someone with a Cracker Jack Prize of a doctorate, but I also suspect that if Dr. Biden were a man, using the well-earned title of Dr. wouldn't be an issue.

Seriously, WSJ? It's these kinds of articles that make me question whether I should keep subscribing to you. It's 2020. Do better.

COVID

2020-12-09T12:55:00.002-06:00

Hey all,

It's been a long time since I've updated! Though I've commented a bit on the pandemic on this blog, I've mostly stayed pretty quiet. Unfortunately, the COVID pandemic has hit home quite literally.

I'm currently in Kansas City with my family. My parents are older and have a variety of risk factors, so they've been staying in all the time. My brother, who lives with them, works in an elementary school, and though he's always been safe and careful, it appears he caught COVID shortly before Thanksgiving. Other than a bad cough, he reported feeling fine. Late last week, my dad had a COVID test done in advance of a procedure, and though he also felt fine, his test came back positive. Shortly after, my mom got a test that also came back positive. They're both experiencing more symptoms now, like shortness of breath and fatigue. My test done that same day came back negative, but yesterday, I started to feel some COVID symptoms myself, mostly fatigue (which could be as much due to stress as COVID).

We're all very lucky that our cases appear to be mild, and my parents' providers are checking in with them regularly to make sure they're recovering well. After this week, I'll probably take advantage of my excess vacation time and take time off from work to rest and recover. I'm in Kansas City for the rest of the year, and thanks to my parents' huge backyard, don't even have to leave to give Zep his much-needed outdoor time.

Stay safe and healthy, everyone!

A Weekend of Writing

2020-09-05T20:41:00.004-05:00

Just a quick update post. I'm spending my weekend doing something I've wanted to do for years - I decided to join the International 3-Day Novel Contest. Every year, people around the world spend Labor Day weekend hunched over their computer or notebook, trying to write approximately 100 double-spaced pages (or more) of a complete novel. Writers submit their work, and in the Spring, the winner gets their book published by Anvil Press.

I'm stocked up on groceries, my dog is staying with a friend (who has also agreed to sign my witness affidavit, that I followed the rules of writing, most importantly that writing only occurred between Saturday from 12:00 am until Monday at 11:59 pm), and I've got 27 pages written. Let's do this.

Pets and Quarantine

2020-08-15T14:57:00.002-05:00

I'm so thankful to have my sweet boy, Zeppelin, in my life. And when quarantine/shelter-in-place began, I was especially thankful to have him, because otherwise, I would have been completely alone. Unsurprisingly, a recent study found I'm not the only one to feel this way:

Animal shelters across the country are being completely cleared out as people seek out creature comfort. In fact, more than one in four 18-37-year-olds with pets got their new friend during quarantine.¹ Pets are bringing much-needed doses of positivity: two-thirds of Gen Z and Millennials living with pets agree their pet has helped them stay positive during this time.¹

Pets are not only showing up in homes—we are seeing them brighten up our feeds, too. Online conversation around pet adoption spiked in mid-March, up 50% from the weekly average.5 Whether they have a furry friend or not, 80% of Gen Z and Millennials say seeing animal content on social media makes them happy, and 74% agree that they find comfort in animal content on social media.¹ Additionally, pet-related hashtags such as #MeetMyPet, #PetRoutine, and #TreatYourPet have been trending on TikTok throughout the pandemic.

In fact, 68% of respondents said their pet helped them feel less alone, 65% said their pet helped them to "stay sane" during the pandemic, 54% believe having a pet has made them be healthier, and 39% said they'd been talking to their pet more during quarantine (guilty).

If you wish you had a four-legged friend during this difficult time, there are tons in need of a good home! I'm so glad this sweet guy is part of mine:

Creating Things

2020-08-12T20:03:00.003-05:00

Normally, this time of year, we'd be getting excited for my choir's new season and rehearsals to begin in early September. Sadly, with the pandemic, it's unlikely we'll be getting together then, and I'm not sure how long it will take before it's safe and people begin feeling comfortable gathering in such a way. So I've been seeking out ways to keep some creativity in my life.

I've started drawing again, something I haven't done in years. I'm a bit rusty but hey - practice practice, right? I started with some pretty flowers from my parents' backyard, in a combination of soft chalk pastels (my favorite medium) and colored pencil:

And my next project is going to be a self-portrait, something I've never done before. Some early work with pencil that I'll fill in soon (thinking again a combo of colored pencil and chalk pastels):

I also had some fun putting together a Lego Architecture set of Paris:

What mainly sparked this round of creativity was writing and recording an arrangement for my choir's virtual benefit. I had so much fun with that, I'm going to keep doing it! I'm planning to share that video soon, and have also started recording some other a cappella arrangements I plan on sharing.

And lastly, because I needed to bring Zep into the fun too, I've finally set up an Instagram for him. If you're on the 'gram, you can follow him here: https://www.instagram.com/zeppelinblackdog/

Coronavirus "Truthers" and Men Without Masks

2020-08-11T11:11:00.003-05:00

Two articles related to coronavirus crossed my newsfeed this morning. First is an inside look at the various Coronavirus "Truth" sites on Facebook, which peddle a variety of misinformation - from the argument that mask-wearing is a prelude to the imposition of Sharia law to masks as a way to increase child sex trafficking:

Just searching “coronavirus” will take you to a host of legitimate resources: pages for the CDC, the World Health Organization and the American Medical Association. But add a word like “truth” and suddenly you’re on a different planet: groups that exist as safe spaces for coronavirus skeptics to share theories of what’s really going on.

For every post or meme that bears a “False Information” label and links to fact-checking sites, there are dozens that elude this moderation, often as they do not present a debunkable statement. How exactly are you supposed to disprove the notion that face-mask enforcement is a prelude to some requirement that women wear the Muslim niqab?

The misinformation is so diversified (yet interconnected and overlapping) that you are bound to find your personal bogeyman at the bottom of the rabbit hole. These memes and talking points are made to frighten while appealing to your “common sense,” to flatter your intellect as it suckers you in with specious “logic” and emotional whataboutery.

Sadly, I've seen a lot of these memes and specious arguments on the pages of friends and acquaintances.

The second article discusses research that attempts to explain why men are being hit harder with Coronavirus: performative masculinity:

Poll after poll, most recently a Gallup poll from July 13, has found American men are more likely to not wear masks compared to women. Specifically, the survey found that 34 percent of men compared to 54 percent of women responded they “always” wore a mask when outside their home and that 20 percent of men said they “never” wore a mask outside their home (compared to just 8 percent of women).

Tyler Reny, a postdoctoral research fellow at Washington University in St. Louis, found [similar results] by combing through data from the Democracy Fund + UCLA Nationscape project, a public opinion survey that’s been interviewing more than 6,000 Americans about the virus per week since March 19.

“Those who had more sexist attitudes were far less likely to report feeling concerned about the pandemic, less likely to support state and local coronavirus policies, less likely to take precautions like washing their hands or wearing masks, and more likely to get sick than those with less sexist attitudes,” Reny told me. “What I found is that sexist attitudes are very predictive of all four sets of [aforementioned] outcomes, even after accounting for differences in partisanship, ideology, age, education, and population density.”

Stay healthy, stay informed, and please:

TV Shows on the "Big 3" Streaming Services

2020-08-10T18:44:00.011-05:00

2020 has been a tough year, and I've been doing my best to keep busy (and distracted from all the insanity - both at the personal and worldwide levels). Earlier this year, I took a course in machine learning techniques and have been working on applying those techniques to work datasets, as well as fun sets through Kaggle.com.

Today, I thought I'd share another dataset I discovered through Kaggle: TV shows available on one or more streaming service (Netflix, Hulu, Prime, and Disney+). There are lots of fun things we could do with this dataset. Let's start with some basic visualization and summarization.

setwd("~/Dropbox")

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Shows <- read_csv("tv_shows.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Title = col_character(),
##   Year = col_double(),
##   Age = col_character(),
##   IMDb = col_double(),
##   `Rotten Tomatoes` = col_character(),
##   Netflix = col_double(),
##   Hulu = col_double(),
##   `Prime Video` = col_double(),
##   `Disney+` = col_double(),
##   type = col_double()
## )

First, we can do some basic summaries, such as how many shows in the dataset are on each of the streaming services.

Counts <- Shows %>%
  summarise(Netflix = sum(Netflix),
            Hulu = sum(Hulu),
            Prime = sum(`Prime Video`),
            Disney = sum(`Disney+`)) %>%
  pivot_longer(cols = Netflix:Disney,
               names_to = "Service",
               values_to = "Count")

Counts %>%
  ggplot(aes(Service,Count)) +
  geom_col()

The biggest selling point of Disney+ is to watch their movies, though the few TV shows they offer can't really be viewed elsewhere (e.g., The Mandalorian). For the sake of simplicity, we'll drop Disney+, and focus on the big 3 services for TV shows.

The dataset also contains an indicator of recommended age, which we can plot.

Shows <- Shows %>%
  mutate(Age = factor(Age,
                      labels = c("all",
                                 "7+",
                                 "13+",
                                 "16+",
                                 "18+"),
                      ordered = TRUE))

Shows %>%
  ggplot(aes(Age)) +
  geom_bar()

Many are 'NA' for age, though it isn't clear why. Are these older shows, added before these streaming services were required to add guidance on these issues? Is this issue seen more for a particular streaming site? Let's find out

Shows %>%
  group_by(Age) %>%
  summarise(Count = n(),
            Year_min = min(Year),
            Year_max = max(Year),
            Prime = sum(`Prime Video`)/2144,
            Netflix = sum(Netflix)/1931,
            Hulu = sum(Hulu)/1754)

## Warning: Factor `Age` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 6 x 7
##   Age   Count Year_min Year_max    Prime Netflix   Hulu
##   <ord> <int>    <dbl>    <dbl>    <dbl>   <dbl>  <dbl>
## 1 all       4     1995     2003 0.000466 0.00155 0     
## 2 7+     1018     1955     2020 0.0975   0.206   0.293 
## 3 13+     750     1980     2020 0.0849   0.186   0.136 
## 4 16+     848     1943     2020 0.104    0.155   0.208 
## 5 18+     545     1932     2020 0.0896   0.0886  0.0906
## 6 <NA>   2446     1901     2020 0.623    0.363   0.272

It seems the biggest "offender" for missing age information is Prime - about 62% of the shows don't have an age indicator. More surprising, though, is the minimum year for some of these categories. I'm no expert in the history of TV, but I don't think any shows were being broadcast in 1901. What are these outliers?

YearOutliers <- Shows %>%
  filter(Year < 1940)

list(YearOutliers$Title)

## [[1]]
## [1] "Born To Explore"                    "The Three Stooges"                 
## [3] "The Little Rascals Classics"        "Space: The New Frontier"           
## [5] "Gods & Monsters with Tony Robinson" "History of Westinghouse"           
## [7] "Betty Boop"

Four of these entries are clearly in error - these are newer shows. This isn't important at the moment, but it's interesting nonetheless.

In terms of getting the most "bang for your buck," Amazon Prime has the most shows to offer (though if you're looking for data on recommended age for the TV show, Prime has the most missingness). But Hulu and Netflix, in terms of volume, are pretty comparable to Prime. What can be said about the quality of content on each of the 3?

The dataset offers some indicators of quality: IMDb rating and Rotten Tomatoes score. How do the 3 services measure up on these indicators?

Netflix <- Shows %>%
  filter(Netflix == 1) %>%
  select(IMDb, `Rotten Tomatoes`) %>%
  mutate(Service = "Netflix")

Hulu <- Shows %>%
  filter(Hulu == 1) %>%
  select(IMDb, `Rotten Tomatoes`) %>%
  mutate(Service = "Hulu")

Prime <- Shows %>%
  filter(`Prime Video` == 1) %>%
  select(IMDb, `Rotten Tomatoes`) %>%
  mutate(Service = "Prime")

BigThree <- rbind(Netflix, Hulu, Prime)

BigThree <- BigThree %>%
  mutate(RotTom = as.numeric(sub("%","",`Rotten Tomatoes`))/100)

BigThree %>%
  ggplot(aes(Service, IMDb)) +
  geom_boxplot()

## Warning: Removed 1194 rows containing non-finite values (stat_boxplot).

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

BigThree %>%
  ggplot(aes(Service, RotTom)) +
  geom_boxplot() +
  scale_y_continuous(labels = percent)

## Warning: Removed 4772 rows containing non-finite values (stat_boxplot).

It doesn't appear the 3 streaming services differ too much in terms of quality. But there's more analysis we can do of this dataset. More later.

Free Virtual Concert!

2020-07-07T15:01:00.001-05:00

One of my hobbies is singing, and for the last 15 years, I've been a member of the Apollo Chorus of Chicago. As with many musical arts organizations, we canceled our Spring concerts, including our annual Apollo After Hours benefit, due to COVID-19. It's unclear when in the future music organizations will be able to have in-person concerts again - possibly years.

But that doesn't mean we can't make - and share - beautiful music with you. On Friday, July 17 at 7PM, we'll be broadcasting our annual benefit as a free, virtual performance. Lots of singers in my choir have created videos to be included in the broadcast, including me! Here's a photo preview:

I'll be performing an a cappella arrangement I wrote of a Sara Bareilles song, "Breathe Again." If you want to hear it, you'll have to tune in! Find out more and sign up to get the link once it goes live here.

Flying Saucers and Bright Lights: A Data Visualization

2020-06-25T12:06:00.007-05:00

UFO Sightings by Shape and Year

Earlier last week, I taught part 2 of a course on using R and tidyverse for my work colleagues. I wanted a fun dataset to use as an example for coding exercises throughout. There was really only one choice.

I found this great dataset through kaggle.com - UFO sightings reported to the National UFO Reporting Center (NUFORC) through 2014. This dataset gave lots of variables we could play around with, and I'd like to use it in a future session with my colleagues to talk about the process of cleaning data.

If you're interested in learning more about R and tidyverse, you can access my slides from the sessions here. (We stopped at filtering and picked up there for part 2, so everything is in one Powerpoint file.)

While working with the dataset to plan my learning sessions, I started playing around and thought it would be fun to show the various shapes of UFOs reported over time, to see if there were any shifts. Spoiler: There were. But I needed to clean the data a bit first.

setwd("~/Downloads/UFO Data")
library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

options(scipen = 999)

UFOs <- read_csv("UFOsightings.csv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   datetime = col_character(),
##   city = col_character(),
##   state = col_character(),
##   country = col_character(),
##   shape = col_character(),
##   `duration (seconds)` = col_double(),
##   `duration (hours/min)` = col_character(),
##   comments = col_character(),
##   `date posted` = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )

## Warning: 4 parsing failures.
##   row                col               expected   actual               file
## 27823 duration (seconds) no trailing characters `        'UFOsightings.csv'
## 35693 duration (seconds) no trailing characters `        'UFOsightings.csv'
## 43783 latitude           no trailing characters q.200088 'UFOsightings.csv'
## 58592 duration (seconds) no trailing characters `        'UFOsightings.csv'

There are 30 shapes represented in the data. That's a lot to show in a single figure.

UFOs %>%
  summarise(shapes = n_distinct(shape))

## # A tibble: 1 x 1
##   shapes
##    <int>
## 1     30

If we look at the different shapes in the data, we can see some overlap, as well as shapes with low counts.

UFOs %>%
  group_by(shape) %>%
  summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 30 x 2
##    shape    count
##    <chr>    <int>
##  1 changed      1
##  2 changing  1962
##  3 chevron    952
##  4 cigar     2057
##  5 circle    7608
##  6 cone       316
##  7 crescent     2
##  8 cross      233
##  9 cylinder  1283
## 10 delta        7
## # ... with 20 more rows

For instance, "changed" only appears in one record. But "changing," which appears in 1,962 records should be grouped with "changed." After inspecting all the shapes, I identified the following categories that accounted for most of the different shapes:

changing, which includes both changed and changing
circles, like disks, domes, and spheres
triangles, like deltas, pyramids, and triangles
four or more sided: rectangles, diamonds, and chevrons
light, which counts things like flares, fireballs, and lights

I also made an "other" category for shapes with very low counts that didn't seem to fit in the categories above, like crescents, teardrops, and formations with no further specification of shape. Finally, shape was blank for some records, so I made an "unknown" category. Here's the code I used to recategorize shape.

changing <- c("changed", "changing")
circles <- c("circle", "disk", "dome", "egg", "oval","round", "sphere")
triangles <- c("cone","delta","pyramid","triangle")
fourormore <- c("chevron","cross","diamond","hexagon","rectangle")
light <- c("fireball","flare","flash","light")
other <- c("cigar","cylinder","crescent","formation","other","teardrop")
unknown <- c("unknown", 'NA')

UFOs <- UFOs %>%
  mutate(shape2 = ifelse(shape %in% changing,
                         "changing",
                         ifelse(shape %in% circles,
                                "circular",
                                ifelse(shape %in% triangles,
                                       "triangular",
                                       ifelse(shape %in% fourormore,
                                              "four+-sided",
                                              ifelse(shape %in% light,
                                                     "light",
                                                     ifelse(shape %in% other,
                                                            "other","unknown")))))))

My biggest question mark was cigar and cylinder. They're not really circles, nor do they fall in the four or more sided category. I could create another category called "tubes," but ultimately just put them in other. Using the code above as an example, you could see what happens to the chart if you put them in another category or create one of their own.

For the chart, I dropped the unknowns.

UFOs <- UFOs %>%
  filter(shape2 != "unknown")

Now, to plot shapes over time, I need to extract date information. The "datetime" variable is currently a character, so I have to convert that to a date. I then pulled out year, so that each point on my figure was the count of that shape observed during a given year.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

UFOs <- UFOs %>%
  mutate(Date2 = as.Date(datetime, format = "%m/%d/%Y"),
         Year = year(Date2))

Now we have all the information we need to plot shapes over time, to see if there have been changes. We'll create a summary dataframe by Year and shape2, then create a line chart with that information.

Years <- UFOs %>%
  group_by(Year, shape2) %>%
  summarise(count = n())

## `summarise()` regrouping output by 'Year' (override with `.groups` argument)

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(ggthemes)

Years %>%
  ggplot(aes(Year, count, color = shape2)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(1910,2020,10)) +
  scale_y_continuous(breaks = seq(0,3000,500), labels = comma) +
  labs(color = "Object Shape", title = "From Flying Saucers to Bright Lights:\nSightings of UFO Shapes Over Time") +
  ylab("Number of Sightings") +
  theme_economist_white() +
  scale_color_tableau() +
  theme(plot.title = element_text(hjust = 0.5))

Until the mid-90s, the most commonly seen UFO was circular. After that, light shapes became much more common. I'm wondering if this could be explained in part by UFOs in pop culture, moving from the flying saucers of earlier sci-fi to the bright lights without discernible shape in the more recent sci-fi. The third most common shape is our "other" category, which suggests we might want to rethink that one. It could be that some of the shapes within that category are common enough to warrant their own category, while receiving other for those that don't have a good category of their own. Cigar and cylinder, for instance, have high counts and could be put in their own category. Feel free to play around with the data and see what you come up with!

Space Force: A Review

2020-06-10T08:00:00.000-05:00

I've continued to work from home during our shelter-in-place (something my boss recently told me we'll be doing for a while). During my copious downtime, I've gotten to watch a lot of things I've had on my watch-list, including the Netflix original series, Space Force.

I've made my way through season 1 of the series, and thoroughly enjoyed it. I was surprised to learn - partway through watching - that critics did not enjoy the series nearly as much as I did. I'll get to that shortly.

I loved the political satire element of the show, that it was inspired by statements by our buffoon of a president. And I loved the periodic texts and tweets they referenced from a character they only referred to as "POTUS" (although, we all know who they mean). But really, I felt the critics were expecting something very different from what the show gave us, and that is the reason for their negative review.

While the concept is hilarious, and Mark Naird (Steve Carrell's character) is often a buffoon, the show is really a family drama framed by absurdist comedy. General Naird is a single father of a teenage daughter (Diana Silver, from Booksmart, which I also thoroughly enjoyed), after his wife (played brilliantly by Lisa Kudrow) is imprisoned for an unmentioned crime (which earned her 40-60 years, so clearly really bad). The show deals with a variety of family issues, not just the aforementioned single parenthood, but also teenage rebellion and substance abuse, fear of abandonment, and a parent who often feels married to their job. It dealt with the concept of an open marriage in a way that was authentic, while also being heartbreaking and funny at the same time. The show made me cry just as often as it made me laugh, and I could often relate to Mark's character - his heartbreak when his wife suggested an open marriage was so real, I bawled. It poked fun at the full political spectrum, as well as at Boomers, X-ers, and Millennials alike.

I think a lot of people were expecting Michael Scott as a general, but Mark Naird - though often a goof who really didn't understand science, which was an important part of his job, personified by his chief scientist (played so wonderfully by John Malkovich: better casting does not exist) - showed a surprising depth and understanding of people, in ways that both surprised and confirmed the conclusions of his scientists. Michael Scott seemed oblivious to the people who worked from him and showed zero understanding of people skills, while Mark Naird thought first and foremost about the people, and spoke eloquently on the topic.

I especially loved the character of Captain Ali (played by Tawny Newsome) and look forward (hopefully) to learning more about her character. Of all the characters on the show, she's my favorite.

It was also a joy to see Fred Willard as Mark's elderly father, who since filming his role has passed away. He will very much be missed and I'll be interested in seeing how they deal with the actor's death (since season 2 has not even been greenlit, let alone filmed). My only complaint was with the cheap jokes at his elderly mother's expense, including at one point showing the caretaker giving her CPR while Mark's father obliviously (and jovially) spoke on the phone. Mark's mother obviously has both lung (due to her being on oxygen) and heart (due to the CPR) issues, and as the daughter of a man with similar issues, I would have wished a show with so much heart had been more delicate with these conditions, rather than using them for cheap laughs.

My only disappointment with Space Force (other than my complaint above) is with the critics' reaction to it. I sincerely hope there is a season 2.

Zoomies

2020-05-12T08:58:00.001-05:00

Check out this adorable Zoom meeting:

Still having the company meetings online. pic.twitter.com/aR3LfuSdKl

— Andrew Cotter (@MrAndrewCotter) May 11, 2020

Statistics Sunday: My 2019 Reading

2020-05-03T09:00:00.000-05:00

I've spent the month of April blogging my way through the tidyverse, while using my reading dataset from 2019 as the example. Today, I thought I'd bring many of those analyses and data manipulation techniques together to do a post about my reading habits for the year.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

As you recall, I read 87 books last year, by 42 different authors.

reads2019 %>%
  summarise(Books = n(),
            Authors = n_distinct(Author))

## # A tibble: 1 x 2
##   Books Authors
##   <int>   <int>
## 1    87      42

Using summarise, we can get some basic information about each author.

authors <- reads2019 %>%
  group_by(Author) %>%
  summarise(Books = n(),
            Pages = sum(Pages),
            AvgRating = mean(MyRating),
            Oldest = min(OriginalPublicationYear),
            Newest = max(OriginalPublicationYear),
            AvgRT = mean(read_time),
            Gender = first(Gender),
            Fiction = sum(Fiction),
            Childrens = sum(Childrens),
            Fantasy = sum(Fantasy),
            Sci = sum(SciFi),
            Mystery = sum(Mystery))

Let's plot number of books by each author, with the bars arranged by number of books.

authors %>%
  ggplot(aes(reorder(Author, desc(Books)), Books)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Author")

I could simplify this chart quite a bit by only showing authors with 2 or more books in the set, and also by flipping the axes so author can be read along the side.

authors %>%
  mutate(Author = fct_reorder(Author, desc(Author))) %>%
  filter(Books > 1) %>%
  ggplot(aes(reorder(Author, Books), Books)) +
  geom_col() +
  coord_flip() +
  xlab("Author")

Based on this data, I read the most books by L. Frank Baum (which makes sense, because I made a goal to reread all 14 Oz series books), followed by Terry Pratchett (which makes sense, because I love him). The code above is slightly more complex, because when I use coord_flip(), the author names were displayed in reverse alphabetical order. Using the factor reorder code plus the reorder in ggplot allowed me to display the chart in order by number of books then by author alphabetical order.

We can also plot average rating by author, which can tell me a little more about how much I like particular authors. Let's plot those for any author who contributed at least 2 books to my dataset.

authors %>%
  filter(Books > 1) %>%
  ggplot(aes(Author, AvgRating)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  ylab("Average Rating")

I only read 2 books by Ann Patchett, but I rated both of her books as 5, giving her the highest average rating. If I look at one of the authors who contributed more than 2 books, John Scalzi (tied for 3rd most read in 2019) has the highest rating, followed by Terry Pratchett (2nd most read). Obviously, though, I really like any of the authors I read at least 2 books from, because they all have fairly high average ratings. Stephen King is the only one with an average below 4, and that's only because I read Cujo, which I hated (more on that later on in this post).

We can also look at how genre affected ratings. Using the genre labels I generated before, let's plot average rating.

genre <- reads2019 %>%
  group_by(Fiction, Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n(),
            AvgRating = mean(MyRating)) %>%
  bind_cols(Genre = c("Non-Fiction",
           "General Fiction",
           "Mystery",
           "Science Fiction",
           "Fantasy",
           "Fantasy Sci-Fi",
           "Children's Fiction",
           "Children's Fantasy"))

genre %>%
  ggplot(aes(reorder(Genre, desc(AvgRating)), AvgRating)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  xlab("Genre") +
  ylab("Average Rating")

Based on this plot, my favorite genres appear to be fantasy, sci-fi, and especially books with elements of both. No surprises here.

Let's dig into ratings on individual books. In my filter post, I identified the 25 books I liked the most (i.e., gave them a 5-star rating). What about the books I disliked? The lowest rating I gave was a 2, but it's safe to say I hated those books. And I also probably didn't like the books I rated as 3.

lowratings <- reads2019 %>%
  filter(MyRating <= 3) %>%
  mutate(Rating = case_when(MyRating == 2 ~ "Hated",
                   MyRating == 3 ~ "Disliked")) %>%
  arrange(desc(MyRating), Author) %>%
  select(Title, Author, Rating)

library(expss)

## 
## Attaching package: 'expss'

## The following objects are masked from 'package:stringr':
## 
##     fixed, regex

## The following objects are masked from 'package:dplyr':
## 
##     between, compute, contains, first, last, na_if, recode, vars

## The following objects are masked from 'package:purrr':
## 
##     keep, modify, modify_if, transpose

## The following objects are masked from 'package:tidyr':
## 
##     contains, nest

## The following object is masked from 'package:ggplot2':
## 
##     vars

as.etable(lowratings, rownames_as_row_labels = FALSE)

Title	Author	Rating
The Scarecrow of Oz (Oz, #9)	Baum, L. Frank	Disliked
The Tin Woodman of Oz (Oz, #12)	Baum, L. Frank	Disliked
Herself Surprised	Cary, Joyce	Disliked
The 5 Love Languages: The Secret to Love That Lasts	Chapman, Gary	Disliked
Boundaries: When to Say Yes, How to Say No to Take Control of Your Life	Cloud, Henry	Disliked
Summerdale	Collins, David Jay	Disliked
When We Were Orphans	Ishiguro, Kazuo	Disliked
Bird Box (Bird Box, #1)	Malerman, Josh	Disliked
Oz in Perspective: Magic and Myth in the L. Frank Baum Books	Tuerk, Richard	Disliked
Cujo	King, Stephen	Hated
Just Evil (Evil Secrets Trilogy, #1)	McKeehan, Vickie	Hated

I'm a little surprised at some of this, because several books I rated as 3 I liked and only a few I legitimately didn't like. The 2 books I rated as 2 I really did hate, and probably should have rated as 1 instead. So based on my new understanding of how I've been using (misusing) those ratings, I'd probably update 3 ratings.

reads2019 <- reads2019 %>%
  mutate(MyRating = replace(MyRating,
                            MyRating == 2, 1),
         MyRating = replace(MyRating,
                            Title == "Herself Surprised", 2))

lowratings <- reads2019 %>%
  filter(MyRating <= 2) %>%
  mutate(Rating = case_when(MyRating == 1 ~ "Hated",
                   MyRating == 2 ~ "Disliked")) %>%
  arrange(desc(MyRating), Author) %>%
  select(Title, Author, Rating)

library(expss)

as.etable(lowratings, rownames_as_row_labels = FALSE)

Title	Author	Rating
Herself Surprised	Cary, Joyce	Disliked
Cujo	King, Stephen	Hated
Just Evil (Evil Secrets Trilogy, #1)	McKeehan, Vickie	Hated

There! Now I have a much more accurate representation of the books I actually disliked/hated, and know how I should be rating books going forward to better reflect how I think of the categories. Of the two I hated, Just Evil... was an e-book I won in a Goodreads giveaway that I read on my phone when I didn't have a physical book with me: convoluted storyline, problematic romantic relationships, and a main character who talked about how much her dog was her baby, and yet the dog was forgotten half the time (even left alone for long periods of time while she was off having her problematic relationship) except when the dog's reaction or protection became important to the storyline. The other, Cujo, I reviewed here; while I'm glad I read it, I have no desire to ever read it again.

Let's look again at my top books, but this time, classify them by long genre descriptions from above. I can get that information into my full reading dataset with a join, using the genre flags. Then I can plot the results from that dataset without having to summarize first.

topbygenre <- reads2019 %>%
  left_join(genre, by = c("Fiction","Childrens","Fantasy","SciFi","Mystery")) %>%
  select(-Books, -AvgRating) %>%
  filter(MyRating == 5)

topbygenre %>%
  ggplot(aes(fct_infreq(Genre))) +
  geom_bar() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  xlab("Genre") +
  ylab("Books")

This chart helps me to better understand my average rating by genre chart above. Only 1 book with elements of both fantasy and sci-fi was rated as a 5, and the average rating is 4.5, meaning there's only 1 other book in that category that had to be rated as a 4. It might be a good idea to either filter my genre rating table to categories with more than 1 book, or add the counts as labels to that plot. Let's try the latter.

genre %>%
  ggplot(aes(reorder(Genre, desc(AvgRating)), AvgRating, label = Books)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  xlab("Genre") +
  ylab("Average Rating") +
  geom_text(aes(x = Genre, y = AvgRating-0.25), size = 5,
                color = "white")

Let's redo this chart, excluding those genres with only 1 or 2 books represented.

genre %>%
  filter(Books > 2) %>%
  ggplot(aes(reorder(Genre, desc(AvgRating)), AvgRating, label = Books)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  xlab("Genre") +
  ylab("Average Rating") +
  geom_text(aes(x = Genre, y = AvgRating-0.25), size = 5,
                color = "white")

While I love both science fiction and fantasy - reading equal numbers of books in those genres - I seem to like science fiction a bit more, based on the slightly higher average rating.

Z is for Additional Axes

2020-04-30T09:00:00.000-05:00

Here we are at the last post in Blogging A to Z! Today, I want to talk about adding additional axes to your ggplot, using the options for fill or color. While these aren't true z-axes in the geometric sense, I think of them as a third, z, axis.

Some of you may be surprised to learn that fill and color are different, and that you could use one or both in a given plot.

Color refers to the outline of the object (bar, piechart wedge, etc.), while fill refers to the inside of the object. For scatterplots, the default shape doesn't have a fill, so you'd just use color to change the appearance of those points.

Let's recreate the pages read over 2019 chart, but this time, I'll just use fiction books and separate them as either fantasy or other fiction; this divides that dataset pretty evenly in half. Here's how I'd generate the pages read over time separately by those two genre categories.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

fantasy <- reads2019 %>%
  filter(Fiction == 1) %>%
  mutate(date_read = as.Date(date_read, format = '%m/%d/%Y'),
         Fantasy = factor(Fantasy, levels = c(0,1),
                          labels = c("Other Fiction",
                                     "Fantasy"))) %>%
  group_by(Fantasy) %>%
  mutate(GenreRead = order_by(date_read, cumsum(Pages))) %>%
  ungroup()

Now I'd just plug that information into my ggplot code, but add a third variable in the aesthetics (aes) for ggplot - color = Fantasy.

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

myplot <- fantasy %>%
  ggplot(aes(date_read, GenreRead, color = Fantasy)) +
  geom_point() +
  xlab("Date") +
  ylab("Pages") +
  scale_x_date(date_labels = "%b",
               date_breaks = "1 month") +
  scale_y_continuous(labels = comma, breaks = seq(0,30000,5000)) +
  labs(color = "Genre of Fiction")

This plot uses the default R colorscheme. I could change those colors, using an existing colorscheme, or define my own. Let's make a fivethirtyeight style figure, using their theme for the overall plot, and their color scheme for the genre variable.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

myplot +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight()

I can also specify my own colors.

myplot +
  scale_color_manual(values = c("#4b0082","#ffd700")) +
  theme_minimal()

The geom_point offers many point shapes; 21-25 allow you to specify both color and fill. But for the rest, only use color.

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 3.6.3

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

ggpubr::show_point_shapes()

## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

Of course, you may have plots where changing fill is best, such as on a bar plot. In my summarize example, I created a stacked bar chart of fiction versus non-fiction with author gender as the fill.

reads2019 %>%
  mutate(Gender = factor(Gender, levels = c(0,1),
                         labels = c("Male",
                                    "Female")),
         Fiction = factor(Fiction, levels = c(0,1),
                          labels = c("Non-Fiction",
                                     "Fiction"),
                          ordered = TRUE)) %>%
  group_by(Gender, Fiction) %>%
  summarise(Books = n()) %>%
  ggplot(aes(Fiction, Books, fill = reorder(Gender, desc(Gender)))) +
  geom_col() +
  scale_fill_economist() +
  xlab("Genre") +
  labs(fill = "Author Gender")

Stacking is the default, but I could also have the bars next to each other.

reads2019 %>%
  mutate(Gender = factor(Gender, levels = c(0,1),
                         labels = c("Male",
                                    "Female")),
         Fiction = factor(Fiction, levels = c(0,1),
                          labels = c("Non-Fiction",
                                     "Fiction"),
                          ordered = TRUE)) %>%
  group_by(Gender, Fiction) %>%
  summarise(Books = n()) %>%
  ggplot(aes(Fiction, Books, fill = reorder(Gender, desc(Gender)))) +
  geom_col(position = "dodge") +
  scale_fill_economist() +
  xlab("Genre") +
  labs(fill = "Author Gender")

You can also use fill (or color) with the same variable you used for x or y; that is, instead of having it be a third scale, it could add some color and separation to distinguish categories from the x or y variable. This is especially helpful if you have multiple categories being plotted, because it helps break up the wall of bars. If you do this, I'd recommend choosing a color palette with highly complementary colors, rather than highly contrasting ones; you probably also want to drop the legend, though, since the axis will also be labeled.

genres <- reads2019 %>%
  group_by(Fiction, Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n())

genres <- genres %>%
  bind_cols(Genre = c("Non-Fiction",
           "General Fiction",
           "Mystery",
           "Science Fiction",
           "Fantasy",
           "Fantasy Sci-Fi",
           "Children's Fiction",
           "Children's Fantasy"))

genres %>%
  filter(Genre != "Non-Fiction") %>%
  ggplot(aes(reorder(Genre, -Books), Books, fill = Genre)) +
  geom_col() +
  xlab("Genre") +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)}) +
  scale_fill_economist() +
  theme(legend.position = "none")

If you only have a couple categories and want to draw a contrast, that's when you can use contrasting shades: for instance, at work, when I plot performance on an item, I use red for incorrect and blue for correct, to maximize the contrast between the two performance levels for whatever data I'm presenting.

I hope you enjoyed this series! There's so much more you can do with tidyverse than what I covered this month. Hopefully this has given you enough to get started and sparked your interest to learn more. Once again, I highly recommend checking out R for Data Science.

Y is for scale_y

2020-04-29T09:00:00.000-05:00

Yesterday, I talked about scale_x. Today, I'll continue on that topic, focusing on the y-axis.

The key to using any of the scale_ functions is to know what sort of data you're working with (e.g., date, continuous, discrete). Yesterday, I talked about scale_x_date and scale_x_discrete. We often put these types of data on the x-axis, while the y-axis is frequently used for counts. When displaying counts, we want to think about the major breaks that make sense, as well as any additional formatting to make them easier to read.

If I go back to my pages over time plot, you'll notice the major breaks are in the tens of thousands. We're generally used to seeing those values with a comma separating the thousands from the hundreds. I could add those to my plot like this (with a little help from the scales package).

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

reads2019 <- reads2019 %>%
  mutate(date_started = as.Date(reads2019$date_started, format = '%m/%d/%Y'),
         date_read = as.Date(date_read, format = '%m/%d/%Y'),
         PagesRead = order_by(date_read, cumsum(Pages)))

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

reads2019 %>%
  ggplot(aes(date_read, PagesRead)) +
  geom_point() +
  scale_x_date(date_labels = "%B",
               date_breaks = "1 month") +
  scale_y_continuous(labels = comma) +
  labs(title = "Cumulative Pages Read Over 2019") +
  theme(plot.title = element_text(hjust = 0.5))

I could also add more major breaks.

reads2019 %>%
  ggplot(aes(date_read, PagesRead)) +
  geom_point() +
  scale_x_date(date_labels = "%B",
               date_breaks = "1 month") +
  scale_y_continuous(labels = comma,
                     breaks = seq(0, 30000, 5000)) +
  labs(title = "Cumulative Pages Read Over 2019") +
  theme(plot.title = element_text(hjust = 0.5))

The scales package offers other ways to format data besides the 3 I've shown in this series (log transformation, percent, and now continuous with comma). It also lets you format data with currency, bytes, ranks, and scientific notation.

Last post tomorrow!

X is for scale_x

2020-04-28T09:00:00.000-05:00

These next two posts will deal with formatting scales in ggplot2 - x-axis, y-axis - so I'll try to limit the amount of overlap and repetition.

Let's say I wanted to plot my reading over time, specifically as a cumulative sum of pages across the year. My x-axis will be a date. Since my reads2019 file initially formats my dates as character, I'll need to use my mutate code to turn them into dates, plus compute my cumulative sum of pages read.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

reads2019 <- reads2019 %>%
  mutate(date_started = as.Date(reads2019$date_started, format = '%m/%d/%Y'),
         date_read = as.Date(date_read, format = '%m/%d/%Y'),
         PagesRead = order_by(date_read, cumsum(Pages)))

This gives me the variables I need to plot my pages read over time.

reads2019 %>%
  ggplot(aes(date_read, PagesRead)) +
  geom_point()

ggplot2 did a fine job of creating this plot using default settings. Since my date_read variable is a date, the plot automatically ordered date_read, formatted as "Month Year", and used quarters as breaks. But we can still use the scale_x functions to make this plot look even better.

One way could be to format years as 2-digit instead of 4. We could also have month breaks instead of quarters.

reads2019 %>%
  ggplot(aes(date_read, PagesRead)) +
  geom_point() +
  scale_x_date(date_labels = "%b %y",
               date_breaks = "1 month")

Of course, we could drop year completely and just show month, since all of this data is for 2019. We could then note that in the title instead.

reads2019 %>%
  ggplot(aes(date_read, PagesRead)) +
  geom_point() +
  scale_x_date(date_labels = "%B",
               date_breaks = "1 month") +
  labs(title = "Cumulative Pages Read Over 2019") +
  theme(plot.title = element_text(hjust = 0.5))

Tomorrow, I'll show some tricks for how we can format the y-axis of this plot. But let's see what else we can do to the x-axis. Let's create a bar graph with my genre data. I'll use the genre names I created for my summarized data last week.

genres <- reads2019 %>%
  group_by(Fiction, Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n())

genres <- genres %>%
  bind_cols(Genre = c("Non-Fiction",
           "General Fiction",
           "Mystery",
           "Science Fiction",
           "Fantasy",
           "Fantasy Sci-Fi",
           "Children's Fiction",
           "Children's Fantasy"))

genres %>%
  ggplot(aes(Genre, Books)) +
  geom_col()

Unfortunately, my new genre names are a bit long, and overlap each other unless I make my plot really wide. There are a few ways I can deal with that. First, I could ask ggplot2 to abbreviate the names.

genres %>%
  ggplot(aes(Genre, Books)) +
  geom_col() +
  scale_x_discrete(labels = abbreviate)

These abbreviations were generated automatically by R, and I'm not a huge fan. A better way might be to add line breaks to any two-word genres. This Stack Overflow post gave me a function I can add to my scale_x_discrete to do just that.

genres %>%
  ggplot(aes(Genre, Books)) +
  geom_col() +
  scale_x_discrete(labels=function(x){sub("\\s", "\n", x)})

MUCH better!

As you can see, the scale_x function you use depends on the type of data you're working with. For dates, scale_x_date; for categories, scale_x_discrete. Tomorrow, we'll show some ways to format continuous data, since that's often what you see on the y-axis. See you then!

By the way, this is my 1000th post on my blog!

W is for Write and Read Data - Fast

2020-04-27T09:00:00.000-05:00

Once again, I'm dipping outside of the tidyverse, but this package and its functions have been really useful in getting data quickly in (and out) of R.

For work, I have to pull in data from a few different sources, and manipulate and work with them to give me the final dataset that I use for much of my analysis. So that I don't have to go through all of that joining, recoding, and calculating each time, I created a final merged dataset as a CSV file that I can load when I need to continue my analysis. The problem is that the most recent version of that file, which contains 13 million+ records, was so large, writing it (and subsequently reading it in later) took forever and sometimes timed out.

That's when I discovered the data.table library, and its fread and fwrite functions. Tidyverse is great for working with CSV files, but a lot of the memory and loading time is used for formatting. fread and fwrite are leaner and get the job done a bit faster. For regular-sized CSV files (like my reads2019 set), the time difference is pretty minimal. But for a 5GB datafile, it makes a huge difference.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

system.time(reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE))

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

##    user  system elapsed 
##    0.00    0.10    0.14

rm(reads2019)

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

system.time(reads2019 <- fread("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv"))

##    user  system elapsed 
##       0       0       0

But let's show how long it took to read my work datafile. Here's the elapsed time from the system.time output.

read_csv:
user system elapsed
61.14 11.72 90.56

fread:
user system elapsed
57.97 16.40 57.19

But the real win is in how quickly this package writes CSV data. Using a package called wakefield, I'll randomly generate 10,000,000 records of survey data, then see how it takes to write the data to file using both write_csv and fwrite.

library(wakefield)

## Warning: package 'wakefield' was built under R version 3.6.3

## 
## Attaching package: 'wakefield'

## The following objects are masked from 'package:data.table':
## 
##     hour, minute, month, second, year

## The following object is masked from 'package:dplyr':
## 
##     id

set.seed(42)

reallybigshew <- r_data_frame(n = 10000000,
                              id,
                              race,
                              age,
                              smokes,
                              marital,
                              Start = hour,
                              End = hour,
                              iq,
                              height,
                              died)


system.time(write_csv(reallybigshew, "~/Downloads/Blogging A to Z/bigdata1.csv"))

##    user  system elapsed 
##  134.22    2.52  137.80

system.time(fwrite(reallybigshew, "~/Downloads/Blogging A to Z/bigdata2.csv"))

##    user  system elapsed 
##    8.65    0.32    2.77

V is for Verbs

2020-04-25T09:00:00.000-05:00

In this series, I've covered five terms for data manipulation:

arrange
filter
mutate
select
summarise

These are the verbs that make up the grammar of data manipulation. They all work with group_by to perform these functions groupwise.

There are scoped versions of these verbs, which add _all, _if, or _at, that allow you to perform these verbs on multiple variables simultaneously. For instance, I could get means for all of my numeric variables like this. (Quick note: I created an updated reading dataset that has all publication years filled in. You can download it here.)

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

reads2019 %>%
  summarise_if(is.numeric, list(mean))

## # A tibble: 1 x 13
##   Pages Book.ID AverageRating OriginalPublica… read_time MyRating Gender Fiction
##   <dbl>   <dbl>         <dbl>            <dbl>     <dbl>    <dbl>  <dbl>   <dbl>
## 1  341.  1.36e7          3.94            1989.      3.92     4.14  0.310   0.931
## # … with 5 more variables: Childrens <dbl>, Fantasy <dbl>, SciFi <dbl>,
## #   Mystery <dbl>, SelfHelp <dbl>

This function generated the mean for every numeric variable in my dataset. But even though they're all numeric, the mean isn't the best statistic for many of them, for instance book ID or publication year. We could just generate means for specific variables with summarise_at.

reads2019 %>%
  summarise_at(vars(Pages, AverageRating, read_time, MyRating), list(mean))

## # A tibble: 1 x 4
##   Pages AverageRating read_time MyRating
##   <dbl>         <dbl>     <dbl>    <dbl>
## 1  341.          3.94      3.92     4.14

You can also request more than one piece of information in your list, and request that R create a new label for each variable.

numeric_summary <- reads2019 %>%
  summarise_at(vars(Pages, AverageRating, read_time, MyRating), list("mean" = mean, "median" = median))

I use the basic verbs anytime I use R. I only learned about scoped verbs recently, and I'm sure I'll add them to my toolkit over time.

Next week is the last week of Blogging A to Z! See you then!

U is for Useful Trick

2020-04-24T09:00:00.000-05:00

This will be a very short post for a line of code I've found unbelievably useful as I analyze data for work. I'm working with datasets containing millions of rows of data. (The most recent one I worked with had about 13 million records.) Because R loads datasets into memory, you can run out of RAM pretty quickly when working with data that large. As I start getting access to more services for databasing and cloud computing, I'm hoping to move some of that data out of my own memory, and onto something with more memory. But for now, I found this quick fix.

I increased my paging file (virtual memory) on my computer as high as it will let me, but R doesn't automatically increase its memory limits. But a single line of code will do that for you.

invisible(utils::memory.limit(64000))

Set that value to whatever your virtual memory is set for. (Note that this value is in MB.) Huge thanks for this Stack Overflow post that taught me how to do this.

Monday, I'll talk about some functions that allow you more quickly read (and write) large files.

T is for Themes

2020-04-23T09:00:00.000-05:00

One of the easiest ways to make a beautiful ggplot is by using a theme. ggplot2 comes with a variety of pre-existing themes. I'll use the genre statistics summary table I created in yesterday's post, and create the same chart with different themes.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

genrestats <- reads2019 %>%
  filter(Fiction == 1) %>%
  arrange(OriginalPublicationYear) %>%
  group_by(Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n(),
            WomenAuthors = sum(Gender),
            AvgLength = mean(Pages),
            AvgRating = mean(MyRating))

genrestats <- genrestats %>%
  bind_cols(Genre = c("General Fiction",
                   "Mystery",
                   "Science Fiction",
                   "Fantasy",
                   "Fantasy SciFi",
                   "Children's Fiction",
                   "Children's Fantasy")) %>%
  ungroup() %>%
  select(Genre, everything(), -Childrens, -Fantasy, -SciFi, -Mystery)

genre <- genrestats %>%
  ggplot(aes(Genre, Books)) +
  geom_col() +
  scale_y_continuous(breaks = seq(0,20,1))

Since I've created a new object for my figure, I can add a theme by typing genre + [theme]. Here's a handful of the ggplot2 themes.

You can also get more themes with additional packages. My new favorite is ggthemes. I've been loving their Economist themes (particularly economist_white), which I've been using for most of the plots I create at work. Here are some of my favorites.

You can also customize different elements of the plot with theme(). For instance, theme(plot.title = element_text(hjust = 0.5)) centers your plot title. theme(legend.position = "none") removes the legend. You could do both of these at once within the same theme() by separating them with commas. This is a great way to tweak tiny elements of your plot, or if you want to create your own custom theme.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

genre +
  theme_economist_white() +
  theme(plot.background = element_rect(fill = "lightblue"))

These themes also have color schemes you can add to your plot. We'll talk about that soon!

S is for summarise

2020-04-22T09:00:00.000-05:00

Today, we'll finally talk about summarise! It's very similar to mutate, but instead of adding or altering a variable in a dataset, it aggregates your data, creating a new tibble with the columns containing your requested summary data. The number of rows will be equal to the number of groups from group_by (if you don't specify any groups, your tibble will have one row that summarizes your entire dataset).

These days, when I want descriptive statistics from a dataset, I generally use summarise, because I can specify the exact statistics I want in the exact order I want (for easy pasting of tables into a report or presentation).

Also, if you're not a fan of the UK spelling, summarize works exactly the same. The same is true of other R/tidyverse functions, like color versus colour.

Let's load the reads2019 dataset and start summarizing!

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

First, we could use summarise to give us some basic descriptives of the whole dataset. If we want to save the results to a tibble, we would give it a new name, or we could just have it display those results and not save them. Here's what happens when I request a summary without saving a new tibble.

reads2019 %>%
  summarise(AllPages = sum(Pages),
            AvgLength = mean(Pages),
            AvgRating = mean(MyRating),
            AvgReadTime = mean(read_time),
            ShortRT = min(read_time),
            LongRT = max(read_time),
            TotalAuthors = n_distinct(Author))

## # A tibble: 1 x 7
##   AllPages AvgLength AvgRating AvgReadTime ShortRT LongRT TotalAuthors
##      <dbl>     <dbl>     <dbl>       <dbl>   <dbl>  <dbl>        <int>
## 1    29696      341.      4.14        3.92       0     25           42

Now, let's create a summary where we do save it as a tibble. And let's have it create some groups for us. In the dataset, I coded author gender, with female authors coded as 1, so I can find out how many women writers are represented in a group by summing that variable. I also want to fill in a few missing publication dates, which seems to happen for Kindle version of books or books by small publishers. This will let me find out my newest and oldest books in each group; I just arrange by publication year, then request last and first, respectively. Two books were published in 2019, so I'll replace the others based on title, then have R give the remaining NAs a year of 2019.

reads2019 %>%
  filter(is.na(OriginalPublicationYear)) %>%
  select(Title)

## # A tibble: 5 x 1
##   Title                                                                         
##   <chr>                                                                         
## 1 Empath: A Complete Guide for Developing Your Gift and Finding Your Sense of S…
## 2 Perilous Pottery (Cozy Corgi Mysteries, #11)                                  
## 3 Precarious Pasta (Cozy Corgi Mysteries, #14)                                  
## 4 Summerdale                                                                    
## 5 Swarm Theory

reads2019 <- reads2019 %>%
  mutate(OriginalPublicationYear = replace(OriginalPublicationYear,
                                           Title == "Empath: A Complete Guide for Developing Your Gift and Finding Your Sense of Self", 2017),
         OriginalPublicationYear = replace(OriginalPublicationYear,
                                           Title == "Summerdale", 2018),
         OriginalPublicationYear = replace(OriginalPublicationYear,
                                           Title == "Swarm Theory", 2016),
         OriginalPublicationYear = replace_na(OriginalPublicationYear, 2019))

genrestats <- reads2019 %>%
  filter(Fiction == 1) %>%
  arrange(OriginalPublicationYear) %>%
  group_by(Childrens, Fantasy, SciFi, Mystery) %>%
  summarise(Books = n(),
            WomenAuthors = sum(Gender),
            AvgLength = mean(Pages),
            AvgRating = mean(MyRating),
            NewestBook = last(OriginalPublicationYear),
            OldestBook = first(OriginalPublicationYear))

Now let's turn this summary into a nicer, labeled table.

genrestats <- genrestats %>%
  bind_cols(Genre = c("General Fiction",
                   "Mystery",
                   "Science Fiction",
                   "Fantasy",
                   "Fantasy SciFi",
                   "Children's Fiction",
                   "Children's Fantasy")) %>%
  ungroup() %>%
  select(Genre, everything(), -Childrens, -Fantasy, -SciFi, -Mystery)

library(expss)

## 
## Attaching package: 'expss'

## The following objects are masked from 'package:stringr':
## 
##     fixed, regex

## The following objects are masked from 'package:dplyr':
## 
##     between, compute, contains, first, last, na_if, recode, vars

## The following objects are masked from 'package:purrr':
## 
##     keep, modify, modify_if, transpose

## The following objects are masked from 'package:tidyr':
## 
##     contains, nest

## The following object is masked from 'package:ggplot2':
## 
##     vars

as.etable(genrestats, rownames_as_row_labels = NULL)

Genre	Books	WomenAuthors	AvgLength	AvgRating	NewestBook	OldestBook
General Fiction	15	10	320.1	4.1	2019	1941
Mystery	9	8	316.3	3.8	2019	1950
Science Fiction	19	4	361.4	4.4	2019	1959
Fantasy	19	3	426.3	4.2	2019	1981
Fantasy SciFi	2	0	687.0	4.5	2009	2006
Children's Fiction	1	0	181.0	4.0	2016	2016
Children's Fantasy	16	1	250.6	4.2	2008	1900

I could have used other base R functions in my summary as well - such as sd, median, min, max, and so on. You can also summarize a dataset and create a plot of that summary in the same code.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

reads2019 %>%
  mutate(Gender = factor(Gender, levels = c(0,1),
                         labels = c("Male",
                                    "Female")),
         Fiction = factor(Fiction, levels = c(0,1),
                          labels = c("Non-Fiction",
                                     "Fiction"),
                          ordered = TRUE)) %>%
  group_by(Gender, Fiction) %>%
  summarise(Books = n()) %>%
  ggplot(aes(Fiction, Books)) +
  geom_col(aes(fill = reorder(Gender, desc(Gender)))) +
  scale_fill_economist() +
  xlab("Genre") +
  labs(fill = "Author Gender")