To demonstrate this technique, I'm using my 2017 reading dataset. A reader requested I make this dataset available, which I have done - you can download it here. This post describes the data in more detail, but the short description is that this dataset contains the 53 books I read last year, with information on book genre, page length, how long it took me to read it, and two ratings: my own rating and the average rating on Goodreads. Look for another dataset soon, containing my 2018 reading data; I made a goal of 60 books and I've already read 50, meaning lots more data than last year. I'm thinking of going for broke and bumping my reading goal up to 100, because apparently 60 books is not enough of a challenge for me now that I spend so much downtime reading.

First, I'll load my dataset, then conduct the basic linear model I demonstrated in the post linked above.

setwd("~/R") library(tidyverse)

## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' [13]

colnames(books)[13] <- "Author_Gender" myrating<-lm(My_Rating ~ Pages + Read_Time + Author_Gender + Fiction + Fantasy + Math_Stats + YA_Fic, data=books) summary(myrating)

## ## Call: ## lm(formula = My_Rating ~ Pages + Read_Time + Author_Gender + ## Fiction + Fantasy + Math_Stats + YA_Fic, data = books) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.73120 -0.34382 -0.00461 0.24665 1.49932 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.5861211 0.2683464 13.364 <2e-16 *** ## Pages 0.0019578 0.0007435 2.633 0.0116 * ## Read_Time -0.0244168 0.0204186 -1.196 0.2380 ## Author_Gender -0.1285178 0.1666207 -0.771 0.4445 ## Fiction 0.1052319 0.2202581 0.478 0.6351 ## Fantasy 0.5234710 0.2097386 2.496 0.0163 * ## Math_Stats -0.2558926 0.2122238 -1.206 0.2342 ## YA_Fic -0.7330553 0.2684623 -2.731 0.0090 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4952 on 45 degrees of freedom ## Multiple R-squared: 0.4624, Adjusted R-squared: 0.3788 ## F-statistic: 5.53 on 7 and 45 DF, p-value: 0.0001233

These analyses show that I give higher ratings for books that are longer and Fantasy genre, and lower ratings to books that are Young Adult Fiction. Now let's see what happens if I run this same linear model through rpart. Note that this is a slightly different technique, looking for cuts that differentiate outcomes, so it will find slightly different results.

library(rpart) tree1 <- rpart(My_Rating ~ Pages + Read_Time + Author_Gender + Fiction + Fantasy + Math_Stats + YA_Fic, method = "anova", data=books) printcp(tree1)

## ## Regression tree: ## rpart(formula = My_Rating ~ Pages + Read_Time + Author_Gender + ## Fiction + Fantasy + Math_Stats + YA_Fic, data = books, method = "anova") ## ## Variables actually used in tree construction: ## [1] Fantasy Math_Stats Pages ## ## Root node error: 20.528/53 = 0.38733 ## ## n= 53 ## ## CP nsplit rel error xerror xstd ## 1 0.305836 0 1.00000 1.03609 0.17531 ## 2 0.092743 1 0.69416 0.76907 0.12258 ## 3 0.022539 2 0.60142 0.71698 0.11053 ## 4 0.010000 3 0.57888 0.74908 0.11644

These results differ somewhat. Pages is still a significant variable, as is Fantasy. But now Math_Stats (indicating books that are about mathematics or statistics, one of my top genres from last year) also is. These are the variables used by the analysis to construct my regression tree. If we look at the full summary -

summary(tree1)

## Call: ## rpart(formula = My_Rating ~ Pages + Read_Time + Author_Gender + ## Fiction + Fantasy + Math_Stats + YA_Fic, data = books, method = "anova") ## n= 53 ## ## CP nsplit rel error xerror xstd ## 1 0.30583640 0 1.0000000 1.0360856 0.1753070 ## 2 0.09274251 1 0.6941636 0.7690729 0.1225752 ## 3 0.02253938 2 0.6014211 0.7169813 0.1105294 ## 4 0.01000000 3 0.5788817 0.7490758 0.1164386 ## ## Variable importance ## Pages Fantasy Fiction YA_Fic Math_Stats Read_Time ## 62 18 8 6 4 3 ## ## Node number 1: 53 observations, complexity param=0.3058364 ## mean=4.09434, MSE=0.3873265 ## left son=2 (9 obs) right son=3 (44 obs) ## Primary splits: ## Pages < 185 to the left, improve=0.30583640, (0 missing) ## Fiction < 0.5 to the left, improve=0.24974560, (0 missing) ## Fantasy < 0.5 to the left, improve=0.20761810, (0 missing) ## Math_Stats < 0.5 to the right, improve=0.20371790, (0 missing) ## Author_Gender < 0.5 to the right, improve=0.02705187, (0 missing) ## ## Node number 2: 9 observations ## mean=3.333333, MSE=0.2222222 ## ## Node number 3: 44 observations, complexity param=0.09274251 ## mean=4.25, MSE=0.2784091 ## left son=6 (26 obs) right son=7 (18 obs) ## Primary splits: ## Fantasy < 0.5 to the left, improve=0.15541600, (0 missing) ## Fiction < 0.5 to the left, improve=0.12827990, (0 missing) ## Math_Stats < 0.5 to the right, improve=0.10487750, (0 missing) ## Pages < 391 to the left, improve=0.05344995, (0 missing) ## Read_Time < 7.5 to the right, improve=0.04512078, (0 missing) ## Surrogate splits: ## Fiction < 0.5 to the left, agree=0.773, adj=0.444, (0 split) ## YA_Fic < 0.5 to the left, agree=0.727, adj=0.333, (0 split) ## Pages < 370 to the left, agree=0.682, adj=0.222, (0 split) ## Read_Time < 3.5 to the right, agree=0.659, adj=0.167, (0 split) ## ## Node number 6: 26 observations, complexity param=0.02253938 ## mean=4.076923, MSE=0.2248521 ## left son=12 (7 obs) right son=13 (19 obs) ## Primary splits: ## Math_Stats < 0.5 to the right, improve=0.079145230, (0 missing) ## Pages < 364 to the left, improve=0.042105260, (0 missing) ## Fiction < 0.5 to the left, improve=0.042105260, (0 missing) ## Read_Time < 5.5 to the left, improve=0.016447370, (0 missing) ## Author_Gender < 0.5 to the right, improve=0.001480263, (0 missing) ## ## Node number 7: 18 observations ## mean=4.5, MSE=0.25 ## ## Node number 12: 7 observations ## mean=3.857143, MSE=0.4081633 ## ## Node number 13: 19 observations ## mean=4.157895, MSE=0.132964

we see that Fiction, YA_Fic, and Read_Time were also significant variables. The problem is that there is multicollinearity between Fiction, Fantasy, Math_Stats, and YA_Fic. All Fantasy and YA_Fic books are Fiction, while all Math_Stats books are not Fiction. And all YA_Fic books I read were Fantasy. This is probably why the tree didn't use Fiction or YA_Fic. I'm not completely clear on why Read_Time didn't end up in the regression tree, but it may be because my read time was pretty constant among the different splits and didn't add any new information to the tree. If I were presenting these results somewhere other than my blog, I'd probably want to do some follow-up analyses to confirm this fact.

Now the fun part: let's plot our regression tree:

While this plot is great for a quick visualization, I can make a nicer looking plot (which doesn't cut off the bottom text) as a PostScript file.

post(tree1, file = "mytree.ps", title = "Regression Tree for Rating")

I converted that to a PDF, which you can view here.

Hope you enjoyed this post! Have any readers used this technique before? Any thoughts or applications you'd like to share? (Self-promotion highly encouraged!)

I'm retired, and I'm not able to rack up the number of "books read" that you do (and I'm including audiobooks I listen to as work out). Good on you!

ReplyDeleteFrancis Gilbert, PhD (ret.)

As long as you're doing something fun! Reading is one of my favorite activities, so I like to make a lot of time for it. It helps that I commute by train to work every day, and read during lunch, so that's about 1.5 hours of guaranteed reading time each day. I didn't read nearly as much when I had to drive to work. I'm a pretty fast reader too. Hope you're enjoying retirement! Thanks for reading!

ReplyDeleteThat's really nice.

ReplyDeleteI was trying to get back to a reading routine so I set a 30 minute reading per day goal. I've already read 5 books since I started this almost two months ago. I am also gathering a few informations such as reading time, pages read, book title, day of the reading, any observation.

Your analysis gave me an idea on how I will use this data I'm gathering.

Good job!