Sunday, January 21, 2018

Statistics Sunday: Violin Plots

Last week, I described box plots, a staple of introductory statistics courses. But Paul Hanel, a post-doc at the University of Bath, was kind enough to share with me a better way of visualizing data than a box plot: a violin plot.

I'd honestly never heard of a violin plot before, but it combines the best qualities of two forms of data display: the box plot and the probability density plot (the shape of the distribution; see this post on the histogram).

To demonstrate the violin plot, I'll be using R and the ggplot2 package. ggplot2 and other members of the so-called tidyverse in R, like dplyr, are essential to creating tidy objects. A tidy dataframe, for instance, is one in which each column is a variable and each row is a case. ggplot2 gives you a lot of control over the appearance of your graphics and is really the best way to go in creating publication quality graphics. Sometimes I get lazy and just use the R base, like when I simply want to quickly visualize data so I can move onto the next step. (Yes, yes, I know ggplot2 also has some qplot (quickplot) options for that very purpose.)

So let's demonstrate the violin plot using my own reading data - this is a dataset of the 53 books I read in 2017. In this tidy dataset, each case is a book, and each book has multiple variables addressing things like page length, genre, and author gender. I'll read my data in and also make sure some of the key variables are converted to factors - this will become important when I start creating my violin plots.

books<-read.csv("2017_books.csv", header = TRUE)
books$Fiction<-factor(books$Fiction,labels=c("Nonfiction", "Fiction"))
books$Fantasy<-factor(books$Fantasy, labels=c("Non-Fantasy", "Fantasy"))

A couple of bar plots show me that I was almost evenly split on fiction v. nonfiction (I read 30 fiction and 23 nonfiction) and that I read 19 fantasy books (approximately two-thirds of my fiction books).

fiction<-ggplot(books, aes(Fiction))+geom_bar() fiction+labs(title="Type of Books Read in 2017")+ 
 ylab("Number of Books")+
 xlab("Type of Book")

fantasy<-ggplot(books, aes(Fantasy))+geom_bar() fantasy+labs(title="Fantasy Books Read in 2017")+ 
 ylab("Number of Books")+

I could use either of these variables to better understand my sample of books. For instance, I gave fantasy books higher ratings and also gave longer books higher ratings. But how might genre and page length relate to each other? I could potentially visualize number of pages for fantasy and nonfantasy books.

Rather than just typing out the syntax (as I did above), I'll walk through it. First up, always name your objects in R, whatever they may be - data, graphics, etc. In the case of the graphics above, it makes it easier to add on some formatting, and since ggplot2 can be used to do even more interesting things with formatting, your code can get rather long. Mostly, it's a good habit to get into. As Vanessa Ives says in Penny Dreadful (which I've just started watching), "you have to name a thing before it comes to life."

Next, we're using the ggplot syntax to create these graphics. As I mention above, there are also qplots, that have fewer options. qplots are good for quick visuals to check in on the data. ggplot gives you a lot more options and is good for more complex plots and creating publication quality images. In parentheses, we name the dataset we'll be using, books, and in the nested function, aes, which is short for aesthetics, we'll dictate the variables we're using in the display and how to use variable to color code results. Unlike the barplots above, we'll be using two variables in our display, so we need to define them both. Finally, + geom_ tells R which type of visualization we're using. In this case, we'll be using + geom_violin. As you can see, you can embed further aes information within the geom_ as well.

fantlength<-ggplot(books, aes(x = Fantasy, y = Pages) + geom_violin(trim = FALSE, aes(fill = Fantasy))

I'm not finished formatting my graphic yet, but here's our first look at the violin plot.

The typical fantasy book seems to be slightly longer than nonfantasy books, but not by much, and nonfantasy books have a wider range. Of course, I should notes that right now, these violin plots are really just sideways histograms. To complete them, I need to layer a boxplot on top of them. Since I named my object, I can just add that on.


The median for book length is similar but slightly higher for Fantasy books. They have similar interquartile range, but the nonfantasy books have a wider overall range and more outliers. Its not a very surprising conclusion that the nonfantasy books are a much more heterogeneous group, including both fiction and nonfiction books across a wide variety of genres.

Really digging into ggplot2 would take more than one post - in fact multiple - but I should at least show at this point how to do a few things you'd need if you wanted to present this violin chart somewhere. That is, you'd want to add better axis titles and an overall title. Overall title, by default, is left-aligned, so I also need to add a command to center it. Plus, the legend is superfluous since you have the x-axis labels. So adding those things in to the original violin plot would look like this:

  labs(title="Page Length by Genre")+
  ylab("Number of Pages")+

Note you need to add the boxplot command again, since you're referencing the original plot, fantlength. But as you can see, you can add multiple lines with the + sign. Those commands give you this:

Many of these commands are standard and could be used with any kind of ggplot. More on that later!

Still taking requests on stats topics to cover. Let me know in the comments below!

I'm also working on cleaning up my labels and generating multiple labels for different statistics topics. This should make it easier to navigate the various statistics posts on my blog. Stay tuned for that.

1 comment:

  1. Dear Sara,

    thank you for this post on violin plots! A similar kind of plot which I've come to appreciate a lot is the pirateplot (included in the R package "yarrr" by Nathaniel Philipps). It basically adds an indicator for central tendency and a CI or HDI and shows the raw data. Here is a link to the vignette of the pirateplot:

    Kind Regards,