Deeply Trivial: Statistics Sunday

Showing posts with label Statistics Sunday. Show all posts

Sunday, April 7, 2019

F is for Facets

So far this month, I've talked about the different things that affect the outcome of measurement - in that it determines how someone will respond. Those things so far would be item difficulty and person ability. How much ability a person has or how much of the trait they possess will affect how they respond to items of varying ability. Each of things that interacts to affect the outcome of Rasch measurement is called a "facet."

But these don't have to be the only facets in a Rasch analysis. You can have additional facets, making for a more complex model. In our content validation studies, we administer a job analysis survey asking people to rate different job-related tasks. As is becoming standard in this industry, we use two different scales for each item, one rating how frequently this task is performed (so we can place more weight on the more frequently performed items) and one rating how critical it is to perform this task competently to protect the public (so we can place more weight on the highly critical items). In this model, we can differentiate between the two scales and see how the scale used changes how people respond. This means that rating scale also becomes a facet, one with two levels: frequency scale and criticality scale.

When conducting a more complex model like this, we need software that can handle these complexities. The people who brought us Winsteps, the software I use primarily for Rasch analysis, also have a program called Facets, which can handle these more complex models.

In a previous blog post, I talked about a facets model I was working with, one with four facets: person ability, item difficulty, rating scale, and timing (whether they received frequency first or second). But one could use a facets model for other types of data, like judge rating data. The great thing about using facets to examine judge data is that one can also partial out concepts like judge leniency; that is, some judges go "easier" on people than others, and a facets models lets you model that leniency. You would just need to have your judges rate more than one person in the set and have some overlap with other judges, similar to the overlap I introduced in the equating post.

This is the thing I love about Rasch measurement, in that it is a unique approach to measurement that can expand in complexity to whatever measurement situation you're presented with. It's all based on the Rasch measurement model, a mathematical model that represents the characteristics a "good measure" should possess - that's what we'll talk about tomorrow when we examine global fit statistics!

Sunday, March 24, 2019

Statistics Sunday: Blogging A to Z Theme Reveal

I'm excited to announce this year's theme for the Blogging A to Z challenge:

I'll be writing through the alphabet of psychometrics with the Rasch Measurement Model approach. I've written a bit about Rasch previously. You can find those posts here:

A Great Minds in Statistics post about Georg Rasch
Factor Analysis for Psychometrics
Principal Components Analysis (a related approach frequently used in Rasch)
The Log Odds Ratio (a key concept that forms the basis of Rasch measurement)
Running Complex Models in Rasch
How I Became a Psychometrician
Dealing with Missing Data (which Rasch handles beautifully)
A previous A to Z post on Ordinal Variables

Looking forward to sharing these posts! First up is A for Ability on Monday, April 1!

Sunday, March 17, 2019

Statistics Sunday: Standardized Tests in Light of Public Scandal

No doubt, by now, you've heard about the large-scale investigation into college admissions scandals among the wealthy - a scandal that suggests SAT scores, among other things, can in essence be bought. Eliza Shapiro and Dana Goldstein of the NY Times ask if this scandal is "the last straw" for tests like the SAT.

To clarify in advance, I do not nor have I ever worked for the Educational Testing Service or for any organization involved in admissions testing. But as a psychometrician, I have a vested interest in this industry. And I became a psychometrician because of my philosophy: that many things, including ability, achievement, and college preparedness, can be objectively measured if certain procedures and methods are followed. If the methods and procedures are not followed properly in a particular case, the measurement in that case is invalid. That is what happens when a student (or more likely, their parent) pays someone else to take the SAT for them, or bribes a proctor, or finds an "expert" willing to sign off on a disability the student does not have to get extra accommodations.

But because that particular instance of measurement is invalid doesn't damn the entire field to invalidity. It just means we have to work harder. Better vetting of proctors, advances in testing like computerized adaptive testing and new item types... all of this is to help counteract outside variables that threaten the validity of measurement. And expansions in the field of data forensics now include examining anomalous patterns in testing, to identify if some form of dishonesty has taken place - allowing scores to be rescinded or otherwise declared invalid after the fact.

This is a field I feel strongly about, and as I said, really sums up my philosophy in life for the value of measurement. Today, I'm on my way to the Association of Test Publishers Innovations in Testing 2019 meeting in Orlando. I'm certain this recent scandal will be a frequent topic at the conference, and a rallying cry for better protection of exam material and better methods for identifying suspicious testing behavior. Public trust in our field is on the line. It is our job to regain that trust.

Monday, March 11, 2019

Statistics Sunday: Scatterplots and Correlations with ggpairs

As I conduct some analysis for a content validation study, I wanted to quickly blog about a fun plot I discovered today: ggpairs, which displays scatterplots and correlations in a grid for a set of variables.

To demonstrate, I'll return to my Facebook dataset, which I used for some of last year's R analysis demonstrations. You can find the dataset, a minicodebook, and code on importing into R here. Then use the code from this post to compute the following variables: RRS, CESD, Extraversion, Agree, Consc, EmoSt, Openness. These correspond to measures of rumination, depression, and the Big Five personality traits. We could easily request correlations for these 7 variables. But if I wanted scatterplots plus correlations for all 7, I can easily request it with ggpairs then listing out the columns from my dataset I want included on the plot:

library(ggplot2)
ggpairs(Facebook[,c(112,116,122:126)]

(Note: I also computed the 3 RRS subscales, which is why the column numbers above skip from 112 (RRS) to 116 (CESD). You might need to adjust the column numbers when you run the analysis yourself.)

The results look like this:

Since the grid is the number of variables squared, I wouldn't recommend this type of plot for a large number of variables.

Sunday, January 27, 2019

Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

Stacked Bar Chart for Rank Data At work on Friday, I was trying to figure out the best way to display some rank data. What I had were rankings from 1-5 for 10 factors considered most important in a job (such as Salary, Insurance Benefits, and the Opportunity to Learn), meaning each respondent chose and ranked the top 5 from those 10, and the remaining 5 were unranked by that respondent. Without even thinking about the missing data issue, I computed a mean rank and called it a day. (Yes, I know that ranks are ordinal and means are for continuous data, but my goal was simply to differentiate importance of the factors and a mean seemed the best way to do it.) Of course, then we noticed one of the factors had a pretty high average rank, even though few people ranked it in the top 5. Oops.

So how could I present these results? One idea I had was a stacked bar chart, and it took a bit of data wrangling to do it. That is, the rankings were all in separate variables, but I want them all on the same chart. Basically, I needed to create a dataset with:

1 variable to represent the ranking given (1-5, or 6 that I called "Not Ranked")
1 variable to represent the number of people giving that particular rank that particular factor

What I ultimately did was run frequencies for the factor variables, turn those frequency tables into data frames, and merged them together with rbind. I then created chart with ggplot. Here's some code for a simplified example, which only uses 6 factors and asks people to rank the top 3.

First, let's read in our sample dataset - note that these data were generated only for this example and are not real data:

library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.1

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ranks <- read_csv("C:/Users/slocatelli/Desktop/sample_ranks.csv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   RespID = col_integer(),
##   Salary = col_integer(),
##   Recognition = col_integer(),
##   PTO = col_integer(),
##   Insurance = col_integer(),
##   FlexibleHours = col_integer(),
##   OptoLearn = col_integer()
## )

This dataset contains 7 variables - 1 respondent ID and 6 variables with ranks on factors considered important in a job: salary, recognition from employer, paid time off, insurance benefits, flexible scheduling, and opportunity to learn. I want to run frequencies for these variables, and turn those frequency tables into a data frame I can use in ggplot2. I'm sure there are much cleaner ways to do this (and please share in the comments!), but here's one not so pretty way:

salary <- as.data.frame(table(ranks$Salary))
salary$Name <- "Salary"
recognition <- as.data.frame(table(ranks$Recognition))
recognition$Name <- "Recognition by \nEmployer"
PTO <- as.data.frame(table(ranks$PTO))
PTO$Name <- "Paid Time Off"
insurance <- as.data.frame(table(ranks$Insurance))
insurance$Name <- "Insurance"
flexible <- as.data.frame(table(ranks$FlexibleHours))
flexible$Name <- "Flexible Schedule"
learn <- as.data.frame(table(ranks$OptoLearn))
learn$Name <- "Opportunity to \nLearn"

rank_chart <- rbind(salary, recognition, PTO, insurance, flexible, learn)
rank_chart$Var1 <- as.numeric(rank_chart$Var1)

With my not-so-pretty data wrangling, the chart itself is actually pretty easy:

ggplot(rank_chart, aes(fill = Var1, y = Freq, x = Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Ranking of Factors Most Important in a Job") +
  ylab("Frequency") +
  xlab("Job Factors") +
  scale_fill_continuous(name = "Ranking",
                      breaks = c(1:4),
                      labels = c("1","2","3","Not Ranked")) +
  theme_bw() +
  theme(plot.title=element_text(hjust=0.5))

Based on this chart, we can see the top factor is Salary. Insurance is slightly more important than paid time off, but these are definitely the top 2 and 3 factors. Recognition wasn't ranked by most people, but those who did considered it their #2 factor; ditto for flexible scheduling at #3. Opportunity to learn didn't make the top 3 for most respondents.

Sunday, November 25, 2018

Statistics Sunday: Introduction to Regular Expressions

In my last Statistics Sunday post, I briefly mentioned the concept of regular expressions, also known as regex (though note that in some contexts, these refer to different things - see here). A regular expression is a text string, which you ask your program to match. You can use this to look for files with a particular name or extension, or search a corpus of text for a specific word or word(s) that match a certain pattern. This concept is used in many programming languages, and R is no exception. In order to use a regular expression, you need a regular expression engine. Fortunately, R base and many R packages come with a regular expression engine, which you call up with a function. In R base, those functions are things like grep, grepl, regexpr, and so on.

Today, I want to talk about some of the syntax you use to describe different patterns of text characters, which can include letters, digits, white space, and punctuation, starting from the most simple regular expressions. Typing out a specific text string matches only that. In last week's example, I was working with files that all start with the word "exam." So if I had other files in that folder that didn't follow that naming convention, I could look specifically for those by simply typing exam into the pattern attribute of my list.files() function. These files follow exam with a date, formatted as YYYYMMDD, and a letter to delineate each file from that day (a to z). Pretty much all of our files end in a, but in the rare instances where they send two files for a day, that one would end in b. How can I write a regular expression that matches this pattern?

First, the laziest approach, using the * character (which is sometimes called a greedy operator), which tells the regular expression engine to look for any number of characters following a string. For that, I could use this pattern:

list.files(pattern = "exam*")

A less lazy approach that still uses the + greedy operator would be specify the numbers I know will be the same for all files, which would be the year, 2018. In fact, if all of my exam files from every year were in a single folder, I could use that to select just one year:

list.files(pattern = "exam2018*")

Let's say I want all files from October 2018. I could do this:

list.files(pattern = "exam201810*")

That might be as fancy as I need to get for my approach.

We can get more specific by putting information into brackets, such as [a-z] to tell the program to match something with a lowercase letter in that range or [aA] to tell the program to look for something with either lowercase or uppercase a. For example, what if I wanted to find every instance of the word "witch" in The Wizard of Oz? Sometimes, the characters refer to a specific witch (Good Witch, Bad Witch, and so on) or to the concept of being a witch (usually lowercase, such as "I am a good witch"). I could download The Wizard of Oz through Project Gutenberg (see this post for how to use the guternbergr package; the ID for The Wizard of Oz is 55), then run some text analysis to look for any instance of Witch or witch:

witch <- WoO[grep("[Ww]itch", WoO$text),]

Here's what the beginning of the witch dataset looks like:

> witch
# A tibble: 141 x 2
gutenberg_id text
<int> <chr>
1 55 " 12. The Search for the Wicked Witch"
2 55 " 23. Glinda The Good Witch Grants Dorothy's Wish"
3 55 We are so grateful to you for having killed the Wicked Witch of the
4 55 killed the Wicked Witch of the East? Dorothy was an innocent, harmless
5 55 "\"She was the Wicked Witch of the East, as I said,\" answered the little"
6 55 " where the Wicked Witch ruled.\""
7 55 When they saw the Witch of the East was dead the Munchkins sent a swift
8 55 "messenger to me, and I came at once. I am the Witch of the North.\""
9 55 "\"Oh, gracious!\" cried Dorothy. \"Are you a real witch?\""
10 55 "\"Yes, indeed,\" answered the little woman. \"But I am a good witch, and"
# ... with 131 more rows

I'm working on putting together a more detailed post (or posts) about regular expressions, including more complex examples and the components of a regular expression, so check back for that soon!

Sunday, November 18, 2018

Statistics Sunday: Reading and Creating a Data Frame with Multiple Text Files

First Statistics Sunday in far too long! It's going to be a short one, but it describes a great trick I learned recently while completing a time study for our exams at work.

To give a bit of background, this time study involves analzying time examinees spent on their exam and whether they were able to complete all items. We've done time studies in the past to select time allowed for each exam, but we revisit on a cycle to make certain the time allowed is still ample. All of our exams are computer-administered, and we receive daily downloads from our exam provider with data on all exams administered that day.

What that means is, to study a year's worth of exam data, I need to read in and analyze 365(ish - test centers are generally closed for holidays) text files. Fortunately, I found code that would read all files in a particular folder and bind them into a single data frame. First, I'll set the working directory to the location of those files, and create a list of all files in that directory:

setwd("Q:/ExamData/2018")
filelist <- list.files()

For the next part, I'll need the data.table library, which you'll want to install if you don't already have it:

library(data.table)
Exams2018 <- rbindlist(sapply(filelist, fread, simplify = FALSE), use.names = TRUE, idcol = "FileName")

Now I have a data frame with all exam data from 2018, and an additional column that identifies which file a particular case came from.

What if your working directory has more files than you want to read? You can still use this code, with some updates. For instance, if you want only the text files from the working directory, you could add a regular expression to the list.files() code to only look for files with ".txt" extension:

list.files(pattern = "\\.txt$")

If you're only working with a handful of files, you can also manually create the list to be used in the rbindlist function. Like this:

filelist <- c("file1.txt", "file2.txt", "file3.txt")

That's all for now! Hope everyone has a Happy Thanksgiving!

Sunday, October 21, 2018

Statistics Sunday: What Fast Food Can Tell Us About a Community and the World

Two statistical indices crossed my inbox in the last week, both of which use fast food restaurants to measure a concept indirectly.

First up, in the wake of recent hurricanes, is the Waffle House Index. As The Economist explains:

Waffle House, a breakfast chain from the American South, is better known for reliability than quality. All its restaurants stay open every hour of every day. After extreme weather, like floods, tornados and hurricanes, Waffle Houses are quick to reopen, even if they can only serve a limited menu. That makes them a remarkably reliable if informal barometer for weather damage.

The index was invented by Craig Fugate, a former director of the Federal Emergency Management Agency (FEMA) in 2004 after a spate of hurricanes battered America’s east coast. “If a Waffle House is closed because there’s a disaster, it’s bad. We call it red. If they’re open but have a limited menu, that’s yellow,” he explained to NPR, America’s public radio network. Fully functioning restaurants mean that the Waffle House Index is shining green.

Next is the Big Mac Index, created by The Economist:

The Big Mac index was invented by The Economist in 1986 as a lighthearted guide to whether currencies are at their “correct” level. It is based on the theory of purchasing-power parity (PPP), the notion that in the long run exchange rates should move towards the rate that would equalise the prices of an identical basket of goods and services (in this case, a burger) in any two countries.

You might remember a discussion of the "basket of goods" in my post on the Consumer Price Index. And in fact, the Big Mac Index, which started as a way "to make exchange-rate theory more digestible," it's since become a global standard and is used in multiple studies. Now you can use it too, because the data and methodology have been made available on GitHub. R users will be thrilled to know that the code is written in R, but you'll need to use a bit of Python to get at the Jupyter notebook they've put together. Fortunately, they've provided detailed information on installing and setting everything up.

Sunday, October 14, 2018

Statistics Sunday: Some Psychometric Tricks in R

Statistics Sunday: Some Psychometrics Tricks in R It's been a long time since I've posted a Statistics Sunday post! Now that I'm moved out of my apartment and into my house, I have a bit more time on my hands, but work has been quite busy. Today, I'm preparing for 2 upcoming standard-setting studies by drawing a sample of items from 2 of our exams. So I thought I'd share what I'm up to in order to pass on some of these new psychometric tricks I've learned to help me with this project.

Because I can't share data from our item banks, I'll generate a fake dataset to use in my demonstration. For the exams I'm using for my upcoming standard setting, I want to draw a large sample of items, stratified by both item difficulty (so that I have a range of items across the Rasch difficulties) and item domain (the topic from the exam outline that is assessed by that item). Let's pretend I have an exam with 3 domains, and a bank of 600 items. I can generate that data like this:

domain1 <- data.frame(domain = 1, b = sort(rnorm(200)))
domain2 <- data.frame(domain = 2, b = sort(rnorm(200)))
domain3 <- data.frame(domain = 3, b = sort(rnorm(200)))

The variable domain is the domain label, and b is the item difficulty. I decided to sort that variable within each dataset so I can easily see that it goes across a range of difficulties, both positive and negative.

head(domain1)

##   domain         b
## 1      1 -2.599194
## 2      1 -2.130286
## 3      1 -2.041127
## 4      1 -1.990036
## 5      1 -1.811251
## 6      1 -1.745899

tail(domain1)

##     domain        b
## 195      1 1.934733
## 196      1 1.953235
## 197      1 2.108284
## 198      1 2.357364
## 199      1 2.384353
## 200      1 2.699168

If I desire, I can easily combine these 3 datasets into 1:

item_difficulties <- rbind(domain1, domain2, domain3)

I can also easily visualize my item difficulties, by domain, as a group of histograms using ggplot2:

library(tidyverse)

item_difficulties %>%
  ggplot(aes(b)) +
  geom_histogram(show.legend = FALSE) +
  labs(x = "Item Difficulty", y = "Number of Items") +
  facet_wrap(~domain, ncol = 1, scales = "free") +
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let's say I want to draw 100 items from my item bank, and I want them to be stratified by difficulty and by domain. I'd like my sample to range across the potential item difficulties fairly equally, but I want my sample of items to be weighted by the percentages from the exam outline. That is, let's say I have an outline that says for each exam: 24% of items should come from domain 1, 48% from domain 2, and 28% from domain 3. So I want to draw 24 from domain1, 48 from domain2, and 28 from domain3. Drawing such a random sample is pretty easy, but I also want to make sure I get items that are very easy, very hard, and all the levels in between.

I'll be honest: I had trouble figuring out the best way to do this with a continuous variable. Instead, I decided to classify items by quartile, then drew an equal number of items from each quartile.

To categorize by quartile, I used the following code:

domain1 <- within(domain1, quartile <- as.integer(cut(b, quantile(b, probs = 0:4/4), include.lowest = TRUE)))

The code uses the quantile command, which you may remember from my post on quantile regression. The nice thing about using quantiles is that I can define that however I wish. So I didn't have to divide my items into quartiles (groups of 4); I could have divided them up into more or fewer groups as I saw fit. To aid in drawing samples across domains of varying percentages, I'd probably want to pick a quantile that is a common multiple of the domain percentages. In this case, I purposefully designed the outline so that 4 was a common multiple.

To draw my sample, I'll use the sampling library (which you'll want to install with install.packages("sampling") if you've never done so before), and the strata function.

library(sampling)
domain1_samp <- strata(domain1, "quartile", size = rep(6, 4), method = "srswor")

The resulting data frame has 4 variables - the quartile value (since that was used for stratification), the ID_unit (row number from the original dataset), probability of being selected (in this case equal, since I requested equally-sized strata), and stratum number. So I would want to merge my item difficulties into this dataset, as well as any identifiers I have so that I can pull the correct items. (For the time being, we'll just pretend row number is the identifier, though this is likely not the case for large item banks.)

domain1$ID_unit <- as.numeric(row.names(domain1))
domain1_samp <- domain1_samp %>%
  left_join(domain1, by = "ID_unit")
qplot(domain1_samp$b)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For my upcoming study, my sampling technique is a bit more nuanced, but this gives a nice starting point and introduction to what I'm doing.

Sunday, September 9, 2018

Statistics Sunday: What is Standard Setting?

In a past post, I talked about content validation studies, a big part of my job. Today, I'm going to give a quick overview of standard setting, another big part of my job, and an important step in many testing applications.

In any kind of ability testing application, items will be written with identified correct and incorrect answers. This means you can generate overall scores for your examinees, whether the raw score is simply the number of correct answers or generated with some kind of item response theory/Rasch model. But what isn't necessarily obvious is how to use those scores to categorize candidates and, in credentialing and similar applications, who should pass and who should fail.

This is the purpose of standard setting: to identify cut scores for different categories, such as pass/fail, basic/proficient/advanced, and so on.

There are many different methods for conducting standard setting. Overall, approaches can be thought of as item-based or holistic/test-based.

For item-based methods, standard setting committee members go through each item and categorize it in some way (the precise way depends on which method is being used). For instance, they may categorize it as basic, proficient, or advanced, or they may generate the likelihood that a minimally qualified candidate (i.e., the person who should pass) would get it right.

For holistic/test-based methods, committee members make decisions about cut scores within the context of the whole test. Holistic/test-based methods still require review of the entire exam, but don't require individual judgments about each item. For instance, committee members may have a booklet containing all items in order of difficulty (based on pretest data), and place a bookmark at the item that reflects the transition from proficient to advanced or from fail to pass.

The importance of standard setting comes down to defensibility. In licensure, for instance, failing a test may mean being unable to work in one's field at all. For this reason, definitions of who should pass and who should fail (in terms of knowledge, skills, and abilities) should be very strong and clearly tied to exam scores. And licensure and credentialing organizations are frequently required to prove, in a court of law, that their standards are fair, rigorously derived, and meaningful.

For my friends and readers in academic settings, this step may seem unnecessary. After all, you can easily categorize students into A, B, C, D, and F with the percentage of items correct. But this is simply a standard (that is, the cut score for pass/fail is 60%), set at some point in the past, and applied through academia.

I'm currently working on a chapter on standard setting with my boss and a coworker. And for anyone wanting to learn more about standard setting, two great books are Cizek and Bunch's Standard Setting and Zieky, Perie, and Livingston's Cut Scores.

Monday, August 27, 2018

Statistics Sunday: Visualizing Regression

Statistics Sunday: Visualizing Regression I had some much needed downtime this weekend, after an exhausting week, along with some self-care - Saturday I had a one-hour deep tissue massage, which left me a little bruised but much more relaxed, and Sunday I spent a few hours in the salon chair having my color touched up, which left me much blonder. Which is why I'm a little late with my Statistics Sunday post, but today, I'm introducing another recently discovered r package: rpart. Short for "recursive partitioning," this package creates decision trees for classification, regression, and survival analyses. Today, I'm going to demonstrate using the rpart package for visualizing regression.

To demonstrate this technique, I'm using my 2017 reading dataset. A reader requested I make this dataset available, which I have done - you can download it here. This post describes the data in more detail, but the short description is that this dataset contains the 53 books I read last year, with information on book genre, page length, how long it took me to read it, and two ratings: my own rating and the average rating on Goodreads. Look for another dataset soon, containing my 2018 reading data; I made a goal of 60 books and I've already read 50, meaning lots more data than last year. I'm thinking of going for broke and bumping my reading goal up to 100, because apparently 60 books is not enough of a challenge for me now that I spend so much downtime reading.

First, I'll load my dataset, then conduct the basic linear model I demonstrated in the post linked above.

setwd("~/R")
library(tidyverse)

## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' [13]

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Title = col_character(),
##   Author = col_character(),
##   G_Rating = col_double(),
##   Started = col_character(),
##   Finished = col_character()
## )

## See spec(...) for full column specifications.

colnames(books)[13] <- "Author_Gender"
myrating<-lm(My_Rating ~ Pages + Read_Time + Author_Gender + Fiction + Fantasy + Math_Stats + YA_Fic, data=books)
summary(myrating)

## 
## Call:
## lm(formula = My_Rating ~ Pages + Read_Time + Author_Gender + 
##     Fiction + Fantasy + Math_Stats + YA_Fic, data = books)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73120 -0.34382 -0.00461  0.24665  1.49932 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.5861211  0.2683464  13.364   <2e-16 ***
## Pages          0.0019578  0.0007435   2.633   0.0116 *  
## Read_Time     -0.0244168  0.0204186  -1.196   0.2380    
## Author_Gender -0.1285178  0.1666207  -0.771   0.4445    
## Fiction        0.1052319  0.2202581   0.478   0.6351    
## Fantasy        0.5234710  0.2097386   2.496   0.0163 *  
## Math_Stats    -0.2558926  0.2122238  -1.206   0.2342    
## YA_Fic        -0.7330553  0.2684623  -2.731   0.0090 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4952 on 45 degrees of freedom
## Multiple R-squared:  0.4624, Adjusted R-squared:  0.3788 
## F-statistic:  5.53 on 7 and 45 DF,  p-value: 0.0001233

These analyses show that I give higher ratings for books that are longer and Fantasy genre, and lower ratings to books that are Young Adult Fiction. Now let's see what happens if I run this same linear model through rpart. Note that this is a slightly different technique, looking for cuts that differentiate outcomes, so it will find slightly different results.

library(rpart)
tree1 <- rpart(My_Rating ~ Pages + Read_Time + Author_Gender + Fiction + Fantasy + Math_Stats + YA_Fic, method = "anova", data=books)
printcp(tree1)

## 
## Regression tree:
## rpart(formula = My_Rating ~ Pages + Read_Time + Author_Gender + 
##     Fiction + Fantasy + Math_Stats + YA_Fic, data = books, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Fantasy    Math_Stats Pages     
## 
## Root node error: 20.528/53 = 0.38733
## 
## n= 53 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.305836      0   1.00000 1.03609 0.17531
## 2 0.092743      1   0.69416 0.76907 0.12258
## 3 0.022539      2   0.60142 0.71698 0.11053
## 4 0.010000      3   0.57888 0.74908 0.11644

These results differ somewhat. Pages is still a significant variable, as is Fantasy. But now Math_Stats (indicating books that are about mathematics or statistics, one of my top genres from last year) also is. These are the variables used by the analysis to construct my regression tree. If we look at the full summary -

summary(tree1)

## Call:
## rpart(formula = My_Rating ~ Pages + Read_Time + Author_Gender + 
##     Fiction + Fantasy + Math_Stats + YA_Fic, data = books, method = "anova")
##   n= 53 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.30583640      0 1.0000000 1.0360856 0.1753070
## 2 0.09274251      1 0.6941636 0.7690729 0.1225752
## 3 0.02253938      2 0.6014211 0.7169813 0.1105294
## 4 0.01000000      3 0.5788817 0.7490758 0.1164386
## 
## Variable importance
##      Pages    Fantasy    Fiction     YA_Fic Math_Stats  Read_Time 
##         62         18          8          6          4          3 
## 
## Node number 1: 53 observations,    complexity param=0.3058364
##   mean=4.09434, MSE=0.3873265 
##   left son=2 (9 obs) right son=3 (44 obs)
##   Primary splits:
##       Pages         < 185 to the left,  improve=0.30583640, (0 missing)
##       Fiction       < 0.5 to the left,  improve=0.24974560, (0 missing)
##       Fantasy       < 0.5 to the left,  improve=0.20761810, (0 missing)
##       Math_Stats    < 0.5 to the right, improve=0.20371790, (0 missing)
##       Author_Gender < 0.5 to the right, improve=0.02705187, (0 missing)
## 
## Node number 2: 9 observations
##   mean=3.333333, MSE=0.2222222 
## 
## Node number 3: 44 observations,    complexity param=0.09274251
##   mean=4.25, MSE=0.2784091 
##   left son=6 (26 obs) right son=7 (18 obs)
##   Primary splits:
##       Fantasy    < 0.5 to the left,  improve=0.15541600, (0 missing)
##       Fiction    < 0.5 to the left,  improve=0.12827990, (0 missing)
##       Math_Stats < 0.5 to the right, improve=0.10487750, (0 missing)
##       Pages      < 391 to the left,  improve=0.05344995, (0 missing)
##       Read_Time  < 7.5 to the right, improve=0.04512078, (0 missing)
##   Surrogate splits:
##       Fiction   < 0.5 to the left,  agree=0.773, adj=0.444, (0 split)
##       YA_Fic    < 0.5 to the left,  agree=0.727, adj=0.333, (0 split)
##       Pages     < 370 to the left,  agree=0.682, adj=0.222, (0 split)
##       Read_Time < 3.5 to the right, agree=0.659, adj=0.167, (0 split)
## 
## Node number 6: 26 observations,    complexity param=0.02253938
##   mean=4.076923, MSE=0.2248521 
##   left son=12 (7 obs) right son=13 (19 obs)
##   Primary splits:
##       Math_Stats    < 0.5 to the right, improve=0.079145230, (0 missing)
##       Pages         < 364 to the left,  improve=0.042105260, (0 missing)
##       Fiction       < 0.5 to the left,  improve=0.042105260, (0 missing)
##       Read_Time     < 5.5 to the left,  improve=0.016447370, (0 missing)
##       Author_Gender < 0.5 to the right, improve=0.001480263, (0 missing)
## 
## Node number 7: 18 observations
##   mean=4.5, MSE=0.25 
## 
## Node number 12: 7 observations
##   mean=3.857143, MSE=0.4081633 
## 
## Node number 13: 19 observations
##   mean=4.157895, MSE=0.132964

we see that Fiction, YA_Fic, and Read_Time were also significant variables. The problem is that there is multicollinearity between Fiction, Fantasy, Math_Stats, and YA_Fic. All Fantasy and YA_Fic books are Fiction, while all Math_Stats books are not Fiction. And all YA_Fic books I read were Fantasy. This is probably why the tree didn't use Fiction or YA_Fic. I'm not completely clear on why Read_Time didn't end up in the regression tree, but it may be because my read time was pretty constant among the different splits and didn't add any new information to the tree. If I were presenting these results somewhere other than my blog, I'd probably want to do some follow-up analyses to confirm this fact.

Now the fun part: let's plot our regression tree:

plot(tree1, uniform = TRUE, main = "Regression Tree for My Goodreads Ratings")
text(tree1, use.n = TRUE, all = TRUE, cex = 0.8)

This tree shows that, before taking into account anything, my average book rating was 4.09. If a book is shorter than 185 pages (9 books in my dataset), it's average rating was 3.33. For longer books (44), the average rating was 4.25. But there's more to it than that. For non-Fantasy books (26), the average was 4.08, while the Fantasy (18 books) average was 4.5. If the book was Math_Stats (7), I gave it an average of 3.86, and if it was not Math_Stats (19), the average was 4.16. (Unfortunately, R cuts off the bottom part of the plot.)

While this plot is great for a quick visualization, I can make a nicer looking plot (which doesn't cut off the bottom text) as a PostScript file.

post(tree1, file = "mytree.ps",
        title = "Regression Tree for Rating")

I converted that to a PDF, which you can view here.

Hope you enjoyed this post! Have any readers used this technique before? Any thoughts or applications you'd like to share? (Self-promotion highly encouraged!)

Sunday, August 19, 2018

Statistics Sunday: Using Text Analysis to Become a Better Writer

Using Text Analysis to Become a Better Writer We all have words we love to use, and that we perhaps use too much. As an example: I have a tendency to use the same transitional statements, to the point that, before I submit a manuscript, I do a find all to see how many times I've used some of my favorites, e.g., additionally, though, and so on.

I'm sure we all have our own words we use way too often.

Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today's demonstration, I read in my (still in-progress) novel - a murder mystery called Killing Mr. Johnson - and did the same type of text analysis I've been demonstrating in recent posts.

To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.

setwd("~/Dropbox/Writing/Killing Mr. Johnson")

library(tidyverse)

KMJ_text <- read_lines('KMJ_full.txt')

KMJ <- tibble(KMJ_text) %>%
  mutate(linenumber = row_number())

I kept my line numbers, which I could use in some future analysis. For now, I'm going to tokenize my data, drop stop words, and examine my most frequently used words.

library(tidytext)
KMJ_words <- KMJ %>%
  unnest_tokens(word, KMJ_text) %>%
  anti_join(stop_words)

## Joining, by = "word"

KMJ_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 75) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()

Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It's a murder mystery, so I'm not too surprised that words like "body" and "death" are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the "-ly adverbs."

ly_words <- KMJ_words %>%
  filter(str_detect(word, ".ly")) %>%
  count(word, sort = TRUE)

head(ly_words)

## # A tibble: 6 x 2
##   word         n
##   <chr>    <int>
## 1 emily      599
## 2 finally     80
## 3 quickly     60
## 4 emily’s     53
## 5 suddenly    39
## 6 quietly     38

Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren't actually -ly adverbs. I'll filter those out then take a look at what I have left.

filter_out <- c("emily", "emily's", "emily’s","family", "reply", "holy")

ly_words <- ly_words %>%
  filter(!word %in% filter_out)

ly_words %>%
  filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()

I use "finally", "quickly", and "suddenly" far too often. "Quietly" is also up there. I think the reason so many writers hate on adverbs is because it can encourage lazy writing. You might write that someone said something quietly or softly, but is there a better word? Did they whisper? Mutter? Murmur? Hiss? Did someone "move quickly" or did they do something else - run, sprint, dash?

At the same time, sometimes adverbs are necessary. I mean, can I think of a complete sentence that only includes an adverb? Definitely. Still, it might become tedious if I keep depending on the same words multiple times, and when a fiction book (or really any kind of writing) is tedious, we often give up. These results give me some things to think about as I edit.

Still have some big plans on the horizon, including some new statistics videos, a redesigned blog, and more surprises later! Thanks for reading!

Sunday, August 12, 2018

Statistics Sunday: Getting Started with the Russian Tweet Dataset

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) - a "Russian troll factory". In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files <- c("IRAhandle_tweets_1.csv",
           "IRAhandle_tweets_2.csv",
           "IRAhandle_tweets_3.csv",
           "IRAhandle_tweets_4.csv",
           "IRAhandle_tweets_5.csv",
           "IRAhandle_tweets_6.csv",
           "IRAhandle_tweets_7.csv",
           "IRAhandle_tweets_8.csv",
           "IRAhandle_tweets_9.csv")
my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files)

each_file <- function(file) {
  tweet <- read_csv(file) }

library(tidyverse)

tweet_data <- NULL
for (file in my_files) {
  temp <- each_file(file)
  temp$id <- sub(".csv", "", file)
  tweet_data <- rbind(tweet_data, temp)
}

Note that this is a large file, with 2,973,371 observations of 16 variables. Let's do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:

Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We're FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.

Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.

News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”

Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”

Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”

But a quick table of the results of the variable, account_category, shows 8 in the dataset.

table(tweet_data$account_category)

## 
##   Commercial   Fearmonger HashtagGamer    LeftTroll     NewsFeed 
##       122582        11140       241827       427811       599294 
##   NonEnglish   RightTroll      Unknown 
##       837725       719087        13905

The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I'm also going to keep only a few key variables. Then I'm going to clean up this dataset to remove links, because I don't need those for my analysis - I certainly wouldn't want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.

reduced <- tweet_data %>%
  select(author,content,publish_date,account_category) %>%
  filter(account_category != "NonEnglish")

library(qdapRegex)

## 
## Attaching package: 'qdapRegex'

reduced$content <- rm_url(reduced$content)

rm(tweet_data)

Now we have a dataset of 2,135,646 observations of 4 variables. I'm planning on doing some analysis on my own of this dataset - and will of course share what I find - but for now, I thought I'd repeat a technique I've covered on this blog and demonstrate a new one.

library(tidytext)

tweetwords <- reduced %>%
  unnest_tokens(word, content) %>%
  anti_join(stop_words)

## Joining, by = "word"

wordcounts <- tweetwords %>%
  count(account_category, word, sort = TRUE) %>%
  ungroup()

head(wordcounts)

## # A tibble: 6 x 3
##   account_category word          n
##   <chr>            <chr>     <int>
## 1 NewsFeed         news     124586
## 2 RightTroll       trump     95794
## 3 RightTroll       rt        86970
## 4 NewsFeed         sports    47793
## 5 Commercial       workout   42395
## 6 NewsFeed         politics  38204

First, I'll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.

tweet_tfidf <- wordcounts %>%
  bind_tf_idf(word, account_category, n) %>%
  arrange(desc(tf_idf))

tweet_tfidf %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(account_category) %>%
  top_n(15) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = account_category)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~account_category, ncol = 2, scales = "free") +
  coord_flip()

## Selecting by tf_idf

But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I'm going to try having it extract 6 topics. We can see how well they line up with the account types.

tweets_dtm <- wordcounts %>%
  cast_dtm(account_category, word, n)

library(topicmodels)
tweets_lda <- LDA(tweets_dtm, k = 6, control = list(seed = 42))
tweet_topics <- tidy(tweets_lda, matrix = "beta")

Now we can pull out the top terms from this analysis, and plot them to see how they lined up.

top_terms <- tweet_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()

Based on these plots, I'd say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.