Monday, July 30, 2018

Feeling Honored

Awesome news first thing on a Monday! My paper, "Effect of the environment on participation in spinal cord injuries/disorders: The mediating impact of resilience, grief, and self-efficacy," published in Rehabilitation Psychology last year was awarded the Harold Yuker Award for Research Excellence:


This paper was the result of a huge survey, overseen by my post-doc mentor, Sherri LaVela, and worked on by many of my amazing VA colleagues. The paper itself uses latent variable path analysis to examine how resilience, grief, and self-efficacy among individuals with spinal cord injuries/disorders mediates to the effect of environmental barriers on ability to participate. The message of the paper was that, while improving environmental barriers is key to increasing participation, by intervening to increase resilience and self-efficacy, and decrease feelings of grief/loss over the injury/disorder, we can impact participation, even if we can't directly intervene to improve all of the environmental barriers.

So cool to get recognition that one of my favorite and most personally meaningful papers was also viewed as meaningful and important to experts in the field.

Sunday, July 29, 2018

Statistics Sunday: More Text Analysis - Term Frequency and Inverse Document Frequency

Statistics Sunday: Term Frequency and Inverse Document Frequency As a mixed methods researcher, I love working with qualitative data, but I also love the idea of using quantitative methods to add some meaning and context to the words. This is the main reason I've started digging into using R for text mining, and these skills have paid off in not only fun blog posts about Taylor Swift, Lorde, and "Hotel California", but also in analyzing data for my job (blog post about that to be posted soon). So today, I thought I'd keep moving forward to other tools you can use in text analysis: term frequency and inverse document frequency.

These tools are useful when you have multiple documents you're analyzing, such as interview text from different people or books by the same author. For my demonstration today, I'll be using (what else?) song lyrics, this time from Florence + the Machine (one of my all-time favorites), who just dropped a new album, High as Hope. So let's get started by pulling in those lyrics.

library(geniusR)

high_as_hope <- genius_album(artist = "Florence the Machine", album = "High as Hope")
## Joining, by = c("track_title", "track_n", "track_url")
library(tidyverse)
library(tidytext)
tidy_hope <- high_as_hope %>%
  unnest_tokens(word,lyric) %>%
  anti_join(stop_words)
## Joining, by = "word"
head(tidy_hope)
## # A tibble: 6 x 4
##   track_title track_n  line word   
##   <chr>         <int> <int> <chr>  
## 1 June              1     1 started
## 2 June              1     1 crack  
## 3 June              1     2 woke   
## 4 June              1     2 chicago
## 5 June              1     2 sky    
## 6 June              1     2 black

Now we have a tidy dataset with stop words removed. Before we go any farther, let's talk about the tools we're going to apply. Often, when we analyze text, we want to try to discover what different documents are about - what are their topics or themes? One way to do that is to look at common words used in a document, which can tell us something about the document's theme. An overall measure of how often a term comes up in a particular document is term frequency (TF).

Removing stop words is an important step before looking at TF, because otherwise, the high frequency words wouldn't be very meaningful - they'd be words that fill every sentence, like "the" or "a." But there still might be many common words that don't get weeded out by our stop words anti-join. And it's often the less frequently used words that tell us something about the meaning of a document. This is where inverse document frequency (IDF) comes in; it takes into account how common a word is across a set of documents, and gives higher weight to words that are infrequent across a set of documents and lower weight to common words. This means that a word used a great deal in one song but very little in the other songs will have a higher IDF.

We can use these two values at the same time, by multiplying them together to form TF-IDF, which tells us the frequency of the term in a document adjusted for how common it is across a set of documents. And thanks to the tidytext package, these values can be automatically calculated for us with the bind_tf_idf function. First, we need to reformat our data a bit, by counting use of each word by song. We do this by referencing the track_title variable in our count function, which tells R to group by this variable, followed by what we want R to count (the variable called word).

song_words <- tidy_hope %>%
  count(track_title, word, sort = TRUE) %>%
  ungroup()

The bind_tf_idf function needs 3 arguments: word (or whatever we called the variable containing our words), the document indicator (in this case, track_title), and the word counts by document (n).

song_words <- song_words %>%
  bind_tf_idf(word, track_title, n) %>%
  arrange(desc(tf_idf))

head(song_words)
## # A tibble: 6 x 6
##   track_title     word          n    tf   idf tf_idf
##   <chr>           <chr>     <int> <dbl> <dbl>  <dbl>
## 1 Hunger          hunger       25 0.236  2.30  0.543
## 2 Grace           grace        16 0.216  2.30  0.498
## 3 The End of Love wash         18 0.209  2.30  0.482
## 4 Hunger          ooh          20 0.189  2.30  0.434
## 5 Patricia        wonderful    10 0.125  2.30  0.288
## 6 100 Years       hundred      12 0.106  2.30  0.245

Some of the results are unsurprising - "hunger" is far more common in the track called "Hunger" than any other track, "grace" is more common in "Grace", and "hundred" is more common in "100 Years". But let's explore the different words by plotting the highest tf-idf for each track. To keep the plot from getting ridiculously large, I'll just ask for the top 5 for each of the 10 tracks.

song_words %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(track_title) %>%
  top_n(5) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = track_title)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~track_title, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf


Some tracks have more than 5 words listed, because of ties, but this plot helps us to look for commonalities and differences across the tracks. There is a strong religious theme across many of the tracks, with concepts like "pray", "god", "grace", and "angel" coming up in many tracks. The song "Patricia" uses many positively-valenced words like "wonderful" and "believer". "No Choir" references music-themed words. And "Sky Full of Song" references things that fly (like "arrow") and things in the sky (like "thunder").

What does Florence Welch have to say about the meaning behind this album?
There is loneliness in this record, and there's issues, and pain, and things that I struggled with, but the overriding feeling is that I have hope about them, and that's what kinda brought me to this title; I was gonna call it The End of Love, which I actually saw as a positive thing cause it was the end of a needy kind of love, it was the end of a love that comes from a place of lack, it's about a love that's bigger and broader, that takes so much explaining. It could sound a bit negative but I didn't really think of it that way.

She's also mentioned that High as Hope is the first album she made sober, so her past struggles with addiction are certainly also a theme of the album. And many reviews of the album (like this one) talk about the meaning and stories behind the music. This information can provide some context to the TF-IDF results.

Saturday, July 28, 2018

New Twitter to Follow

If you enjoy Flash Fiction as much as me, you definitely want to check out the Micro Flash Fiction Twitter account. Here's a taste:

Someone liked this one so much, they illustrated it:

Friday, July 27, 2018

Landing on Lake Shore Drive

Interesting bit of news for the day - a small place had to make an emergency landing on Lake Shore Drive:

Insert your favorite joke about being on LSD here.
No details at the moment, but word is that no one was injured.

Happy Friday, everyone!

Wednesday, July 25, 2018

Blogging Break

You may have noticed I haven't been blogging as much recently. Though in some aspects, I'm busier than I've been in a while, I still have had a lot of downtime, but sadly not as much inspiration to write on my blog. I've got a few stats side projects I'm working on, but nothing to a point I can blog about, and I'm having difficulty with writing some of the code on projects I've been plan on writing about. Hopefully I'll have something soon, and will get to back to posting weekly Statistics Sunday posts.

Here's what's going on with me currently:

  • I had my first conference call with my company's Research Advisory Committee last night, a committee I imagine I'll inherit as my own now that I'm Director of Research
  • I submitted my first novel query to an agent earlier today and received a confirmation email that she got it
  • I've been reading a ton and apparently am 5 books ahead of schedule on my Goodreads reading challenging: 38 books so far this year
  • The research center I used to work for was not renewed, so they'll be shutting their doors in 14 months; I'm sad for my colleagues
  • Today is my work anniversary: I've been at my current job 1 year! My boss emailed me about it this morning, along with this picture:

Wednesday, July 11, 2018

In Remembrance of Gene

Great Demonstration of Bayes' Theorem

Bayes' theorem is an excellent tool, once you wrap your head around it. In fact, it was discovered as sort of an accident that actually horrified Bayes (and others), and was highly controversial even into the 20th century - to the point that many statisticians eschewed inverse probability and, when they used it, did so in secret.

I've blogged about  and applied it several times:
And I highly recommend a couple books to learn more about Bayes': The Theory that Wouldn't Die and Doing Bayesian Analysis. But this morning, I read an excellent demonstration of Bayes' theorem - what is the probability the post's author is asleep given her bedroom light is on?
I have more than 2 months of data from my Garmin Vivosmart watch showing when I fall asleep and wake up. In a previous post, I figured out the probability I am asleep at a given time using Markov Chain Monte Carlo (MCMC) methods.

This is the probability I am asleep taking into account only the time. What if we know the time and have additional evidence? How would knowing that my bedroom light is on change the probability that I am asleep?

We will walk through applying the equation for a time of 10:30 PM if we know my light is on. First, we calculate the prior probability I am asleep using the time and get an answer of 73.90%. The prior provides a good starting point for our estimate, but we can improve it by incorporating info about my light. Knowing that my light is on, we can fill in Bayes’ Equation with the relevant numbers

The knowledge that my light is on drastically changes our estimate of the probability I am asleep from over 70% to 3.42%. This shows the power of Bayes’ Rule: we were able to update our initial estimate for the situation by incorporating more information. While we might have intuitively done this anyway, thinking about it in terms of formal equations allows us to update our beliefs in a rigorous manner.
She's shared her code for these calculations in a Jupyter Notebook, which you check out here.

Sunday, July 8, 2018

Statistics Sunday: Mixed Effects Meta-Analysis

As promised, how to conduct mixed effects meta-analysis in R:


Code used in the video is available here. And I'd recommend the following posts to provide background for this video:

Friday, July 6, 2018

Lots Going On

So much going on right now that I haven't had much time for blogging.
  • I'm almost completely transitioned from my old department, Exam Development, to my new one, Research, of which I am Director and currently the only employee. But my first direct report will be coming on soon! We also have a newly hired Director of Exam Development, and we've already started chatting about some Research-Exam Development collaborative projects.
  • I'm preparing for multiple content validation studies, including one for a brand new certification we'll be offering. So I've been reviewing blueprints for similar exams, curriculum for related programs, and job descriptions from across the US to help build potential topics for the exam, to be reviewed by our expert panel.
  • I've been participating in Camp NaNoWriMo, which happens in April and July, and allows you to set whatever word/page/time/etc. goals you'd like for your manuscript or project. My goal is to finish the novel I wrote for 2016 NaNoWriMo, so I'm spending most of the month editing as well as writing toward a goal of 8,000 additional words.
  • Also related to my book, I got some feedback from an agent that I need to play up the mystery aspect of my book, and try to think of some comparative titles and/or authors - that is, "If you liked X, you'll like my book." So in addition to doing a lot writing, I'm doing a lot of reading, looking for some good comparative works. I've asked a few friends to read some chapters and see if they come up with something as well.
  • I started recording the promised Mixed-Effects Meta-Analysis video earlier this week, but when I listened to what I recorded, you can clearly hear my neighbors shooting off fireworks in the background. So I need to re-record that part and record the rest. Hopefully this weekend.
Bonus writing-related pic, just for fun: I found a book title generator online, and this is the result I got for Mystery:


That's all for now! Hopefully back to regular blogging soon!