Tuesday, January 2, 2018

Data Analysis of My 2017 Reading

This year, I'd like to do more data analysis of things that interest me. While some of it might be publishable in actual journals, a lot of it will be fluffy stuff best suited for my blog. Today, some of that fluffy analysis using the books I read in 2017.

Thanks to Goodreads, I have a lot of data available on my reading habits - when I started and finished each book, how I rated each book (and how that compares to the average Goodreads rating), and what genres I tend to gravitate toward. I pulled my Goodreads data into an Excel spreadsheet, which included the following variables:
  • Book title, string
  • Author, string
  • Goodreads Rating (Average of every rating for this book), numeric from 1 to 5
  • My Rating, integer from 1 to 5
  • Start Date
  • Finish Date
I then calculated/created the following variables
  • A rounded version of the Goodreads rating, since I can only rate 1 to 5
  • A leniency indictor, my rating minus the rounded rating
  • Number of pages (pulled from Amazon)
  • Read time, the number of days it took me to finish
  • Title length, the number of characters in the title
  • Wait, the number of days between finishing one book and starting another; NA if I was reading 2 books at once and was still actively reading another book when I finished one
  • An indicator of author gender
  • A fiction indicator
  • And indicators for the following genres
    • Biography & Memoir
    • Children's Fiction
    • Comedy
    • Economics
    • Experimental Fiction
    • Fantasy
    • History
    • Horror
    • Literary Fiction
    • Math & Statistics
    • Mystery
    • Programming
    • Science
    • Science Fiction
    • Writing
    • YA Fiction
First, I did some simple counts of genres using a Pivot Table and created a bar chart:

As the chart shows, my favorite genre appears to be fantasy (I read 19 books in this genre) followed by math and statistics (11 books). I was surprised at the number of young adult fiction books, because I only picked up one book from that section in the bookstore; the rest I found in general fiction or on a fiction display table. I only found out later (i.e., when I was pulling together this dataset) that they were considered YA fiction. I should note that the counts in the graph above add up to more than 53, because books could belong to more than 1 genre. For instance, last year I read Storm Front, book 1 of the Dresden Files series by Jim Butcher, which qualified as both fantasy (the main character is a wizard) and mystery (he's also a private investigator who solves weird cases often involving the supernatural). On the nonfiction front, The River of Doubt: Theodore Roosevelt's Darkest Journey by Candice Millard qualified as both biography & memoir and history.

I also learned that I read more books by men than women; nearly 70% of the books I read were by men. I was more evenly split on fiction versus nonfiction; 56.6% were fiction. Additionally, my average rating was 4.09, and my average leniency (i.e., my rating was higher than the rounded off Goodreads rating) was only 0.09.

All interesting stuff, but let's do some higher level analysis. For that, I brought in a much better tool than Excel: R. I saved my Excel file into a CSV file and pulled it into R. 

First, let's see what kinds of variables predict the rating I give a book. I only have 53 observations, so I need to be choosy about the variables I include in a linear model. It looks like I'm pretty much in line with the average Goodreads rating, so I'll leave that variable out; it doesn't tell me much. 

Here are some variables I think would impact my rating: number of pages, reading time, author gender, and whether the book was fiction. I'd like to include genre, but I don't want to include all 16; I'll include the top 3 genres, which would be fantasy, math & statistics, and YA fiction. But before I get started with my linear model, I want to examine correlations between some of these variables. For instance, number of pages and reading time are likely correlated with each other. I want to make sure they're not so highly correlated that one could be a proxy for another. 


The correlation between these two variables is high, but I could probably get away with including both in the model, since there's only about 16% shared variance between the two. However, I should note one issue with the reading time variable. Two books were read during November, when I was spending most of my time writing, so these two books have very high reading time values: 17 days (started in October and finished in November) and 21 days (started in November and finished in December). Also, I have one other book that took 16 days to finish, but this was a course textbook, and I read the book following the course schedule. The longest read time not impacted by course schedule or NaNoWriMo is 15 days. So I could try running a correlation between these two variables without those three books with weird values. I can do this pretty easily with a filter.

filtered<-filter(books, Read_Time<16)

This gives me a separate dataset that drops those 3 values. If I run my correlation on this dataset, I get an even lower correlation, 0.31. I expected this correlation to go up, not down, without these 3 cases. So in either case, it seems like including both number of pages and reading time in the same linear model is not going to cause any issues.

Just for fun, I also produced histograms for those two variables. Number of pages was approximately normal, but reading time was positively skewed. (I include code for one of the histograms below, to show how it's done; as you can see, the code is pretty easy.)

hist(books$Pages, main="Book Length", xlab="Number of Pages")

I also looked at relationships between two of my genres, fantasy and YA fiction, since I remembered that many books fell in both. Turns out, all of the YA fiction books were also fantasy.


     0  1
  0 34  0
  1 13  6

I could still include both, and see what additional impact YA fiction as a genre has on ratings. But I could also see some strange results. Let's run a linear model, using My Rating as the y-variable, and number of pages, reading time, author gender, and indicators for fiction and my top 3 genres (fantasy, math & statistics, YA fiction) as predictors. (And then I'll run it once more without YA fiction.)

myrating<-lm(My_Rating ~ Pages + Read_Time + Gender + Fiction + Fantasy + Math_Stats + YA_Fic, data=books)

And here's a screenshot of my results:

So number of pages was significant, even controlling for reading time; on average, longer books got higher ratings from me. Gender of the author had no impact on ratings. While fiction books didn't necessarily get higher ratings than non-fiction books, I tended to give fantasy books higher ratings, and young adult fiction lower ratings. When I ran the model again without YA fiction, the fantasy effect disappeared. It appears I judge YA fantasy more harshly than other fantasy. Perhaps this year, I'll try reading some YA fiction that isn't fantasy and see what happens.

Some other fun factoids from this dataset: the book with the longest title,The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy by Sharon Bertsch McGrayne (161 characters) was read right before the book with the shortest title, Theatre by David Mamet (7 characters). I read 6 books by Neil Gaiman (and 1 cowritten by him), the most of any author.

What do you think? Any other relationships I should explore?


  1. Excellent information about data analysis. Data analysis is the process of systematically applying statistical techniques to describe anything. Such a great data analysis process you have provided here. Also data analysis tools http://www.statisticaldataanalysis.net/inductive-data-analysis-in-qualitative-research/ can help to analysis your data. Thanks for sharing.

  2. This comment has been removed by the author.