Deeply Trivial: More Reading Data Analysis

Last week, I wrote a blog post in which I analyzed my reading habits from 2017. I had so much fun pulling that data together and playing with it that I took things a step farther: I decided to look at my friends' 2017 reading habits as well.

I included all friends on Goodreads who logged at least one book as read in 2017. This gives me data on how many books (and which ones) were read by friends who use Goodreads to log their reading activity. I did not include friends who logged 0 books, because there’s no way of knowing if they 1) did not read at all in 2017 or 2) did not log books they read or logged books without adding a read date. This resulted in a dataset of 40 friends and a total of 692 books.

Other things I should note about the data:

The dataset isn’t as complete as the one I analyzed for myself; this one includes book title, author, page length, and indicators of which reader(s) logged that book. I didn’t include start/read dates, genres, or rating data. (I originally thought about including ratings, but there was surprisingly little overlap among my friends in books read, so that limited analysis options. I may still pull in genre, though.)
Goodreads only gave me first author in my data pull. There are definitely books in the dataset that have multiple authors, but for the sake of simplicity, all author analyses were performed on first author only. Once again, I can pull in these data later if they end up being useful.
When I looked at page counts for readers, I noticed a few very long books, so I examined these books to make sure they were not box sets logged as single books. Most were simply very long books, but two instances were in fact multiple books; one case was a 5000+ page book that was actually a 22-book eBook compilation. For these two cases, I updated book read counts and page numbers to reflect the number of actual books, resulting in a different number of books read in the dataset than would be displayed for the person on Goodreads. But this was important when I started doing analysis on page lengths – my histograms and box plots were shrunk on one side to make up for extreme outliers that were not actually reflective of real book length.
To track individual readers, I used reader initials, which I then converted into a numeric code to protect reader identity. Should anyone express an interest in playing with this dataset, I’d be able to share it with no identifying information included.
A few friends logged audiobooks, which have strange page counts. (For instance, a 1.5 hour audiobook came in at 10 pages! 1.5 hours isn't a long book, but it's certainly longer than 10 pages.) If I could find a print copy of the audiobook either on Goodreads or Amazon, I used that page count. But that left 5 books without real page counts. Information I found online suggested audiobooks are approximately 9300 words per hour, and that a printed book has about 300 words per page. So I used the following conversion: (audiobook length in hours * 9300)/300. This is a gross approximation, but since it only affected 5 books out of 692, I’m okay with it.

Findings

The 9 most popular books in my dataset

Number of books read by a single reader in 2017 ranged from 1 to 190, with an average of 18.5 books. But the mean isn’t a good indicator here. As you can see in the plot below, this is a highly skewed distribution. Almost 28% (n=11) of my friends logged 1 book in 2017 (and this is the mode of the distribution); only 10% (n=4) read more than 50 books, and all but 1 person read fewer than 100 books. The median was 7 books.

The barplot is easier to read without this outlier:

For the most part, each reader was unique in the books he or she read: 94.5% (or 654) books were unique to a single reader, and about 4.1% (29 books) were read by 2 readers in the dataset. That left 9 books (1.3%) read by between 3 and 6 people, which I display in the graphic above. As I mentioned in that previous reading post, the most popular book was The Handmaid’s Tale, read by 6 people. The remaining books were A Man Called Ove, 4 people, and each of the following with 3 people: Dark Matter by Blake Crouch, Harry Potter and the Half-Blood Prince and Harry Potter and the Deathly Hollows both by J.K. Rowling, Into the Water by Paula Hawkins (which won Best Mystery & Thriller in the Goodreads Awards), Thirteen Reasons Why by Jay Asher, Turtles All the Way Down by John Green (which ranked #20 in Amazon's Top 100 list), and Wonder by J.C. Palacio.

True, my dataset probably won’t generalize beyond my friend group, but the popular books match up really well with Amazon’s This Year in Books analysis, which showed The Handmaid’s Tale was the most read fiction book.

The second most popular book on Amazon’s list, It, was read by 2 people in my dataset. Oh, and speaking of Stephen King, he was the most popular author in my data, contributing 13 books read by 8 readers.

The second most popular was Neil Gaiman, with 11 books across 5 readers.

And in fact, going back to my previously noted flaw, that I only analyzed first author, both of these popular authors had 1 book with a coauthor. Sleeping Beauties (winner of Best Horror in Goodread's awards), which is included in the Stephen King's graphic above (because he's first author) wrote that book with his son, Owen. And Neil Gaiman should have 1 additional book in his graphic: Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch. That book was cowritten with Terry Pratchett, who was first author and thus the only one who got "credit" for the book in my dataset. The addition of that book would increase Neil's contribution to 12 books, but would have no effect on number of unique readers, or his rank in terms of popularity.

But I should note that these two were most popular based on number of books + number of readers. If I only went off number of books in the dataset, they would just break the top 5. Based on sheer number of books, Erin Hunter was most popular with 22 books and Victoria Thompson was second with 19 books. Lee Child came in third with 14 books.

The cool thing about those particular results? They came from individual readers. One person read those 22 Erin Hunter books, a different person read the 19 Victoria Thompson books, and a third person read the 14 Lee Child books. (In total, these 3 friends read 345, or 49.9%, of the books included in the dataset.) In fact, it was cool to see the fandom of my different Goodreads friends.

I'll present some more work from this dataset tomorrow, for Statistics Sunday, when I'll be demonstrating the boxplot. So stay tuned for more results from this dataset!

Deeply Trivial

Saturday, January 13, 2018

More Reading Data Analysis - This Time with Friends

No comments:

Post a Comment