Wednesday, October 18, 2017

Statistical Sins: Know Your Variables (A Confession)

We all have the potential to be a statistical sinner; I definitely have been on more than one occasion. This morning, I was thinking about a sin I committed about a year ago at Houghton Mifflin Harcourt. So this is a confessional post.

We were working on a large language survey, involving 8 tests, one of which was new. This is an individually-administered battery of tests, meaning a trained individual gives the test one-on-one to the examinee. Questions are read aloud and the examinee responds either verbally or in writing. Each test only has one set of questions, and is adaptive: the set of questions the examinee receives depends on their pattern of correct answers. If they get the first few questions right, they go on to harder questions, but if they get the first few wrong, they go back in the book to easier questions. The test ends when the examinee gets a certain number incorrect in a row or reaches the end of the book (whichever comes first).

When giving the test, the administrator won't always start at the beginning of the book. Those are the easiest questions, reserved for the youngest/lowest ability test-takers. Each test has recommended starting places, usually based on age, but the administrator is encouraged to use his or her knowledge of the examinee (these tests are often administered by school psychologists, who may have some idea of the examinee's ability) to determine a starting point.

We had one brand new test and needed to generate starting points, since we couldn't use starting points from a previous revision of the battery. We decided, since this new test was strongly related to another test, to generate recommended starting points based on their raw score on this other test. We knew we would need a regression-based technique, but otherwise, I was given complete control over this set of analyses.

After generating some scatterplots, I found the data followed a pretty standard growth curve, specifically a logistic growth curve:


So standard linear regression would not work, because of the curve. We would deal with this in regression by adding additional terms (squared, cubed, and so on) to address the curve.

But the data violated another assumption of regression, even polynomial regression: the variance was not equal (or even approximately equal) across the curve. There was substantially more variation in some parts of the curve than others. In statistical terms, we call this heteroscedasticity. I did some research and found a solution: quantile regression. It's a really cool technique that is pretty easy to pick up if you can understand regression. Essentially, quantile regression allows for different starting points (constants) and slopes depending on the percentile of the individual data point. You can set those percentiles at whatever value you would like. And quantile regression makes no assumptions about heteroscdasticity. I read some articles, learned how to do the analysis in R (using the quantreg package), and away I went.

I was so proud of myself.

We decided to use raw score instead of scale score for the starting points. These tests were developed with the Rasch measurement model, but the test administrator would only get approximate scale score from the tables in the book. Final scores, which are conversions of Rasch logits, are generated by a scoring program used after administering all tests. Since the administrator is obtaining raw scores as he or she goes (you have to know right away if a person responded correctly to determine what question to ask next), this would be readily available and most logical to administrators. I had my Winsteps output, which gave person ID, raw score, Rasch ability, and some other indicators (like standard error of measurement), for each person in our pilot sample. So I imported those outputs from the two tests, matched on ID, and ran my analysis.

I stress once again: I used the Winsteps person output to obtain my raw scores.

My data were a mess. There seemed to be no relationship between scores on the two tests. I went back a step, generating frequencies and correlations. I presented the results to the team and we talked about how this could have happened. Was there something wrong with the test? With the sample? Were we working with the wrong data?

I don't know who figured it out first, but it was not me. Someone asked, "Where did the raw scores come from?" And it hit me.

Winsteps generates raw scores based on the number of items a person answered correctly. Only the questions answered and no others. But for adaptive tests, we don't administer all questions. We only administer the set needed to determine a person's ability. We don't give them easy questions because they don't tell us much about ability. We know the person will get most, if not all, easy questions correct. So when the administrator generates raw scores, he or she adds in points for the easy questions not administered. Winsteps doesn't do that. It simply counts and adds.

There was no relationship between the two variables because I wasn't using the correct raw score variable. I had a column called raw score and just went on autopilot.

So I had a couple days of feeling super proud of myself for figuring out quantile regression... and at least that long feeling like an idiot for running the analysis without really digging into my data. The lack of relationship between the two tests should have been a dead giveaway that there was something wrong with my data. And problems with data are often caused by human error.


Monday, October 16, 2017

Preparing for NaNoWriMo

I'm preparing once again to write 50,000 words in November, as part of National Novel Writing Month (or NaNoWriMo). October is affectionately known as "Preptober" - it's the month where NaNoWriMos go through whatever planning they need to do to win (i.e., get 50,000 words).

As I've blogged before, I'm a plantser: I like having some freedom to see where the story takes me, but I need a roadmap or I get overwhelmed and don't finish. This makes me a sort of hybrid of the plot-driven writers, like Orson Scott Card, and the character-driven writers, like Stephen King.

Speaking of Stephen King, as part of Preptober, I've been reading advice on writing, and yesterday, just finished Stephen King's book on the topic:


It was nice to learn more about his approach, because last year, I really didn't think his approach was for me. I'd tried just sitting down and writing, seeing where I ended up, and that has worked reasonably well for me on short stories, but for something as long as a novel, I get blocked.

What Stephen King does is he comes up with a situation, which may also include where he thinks things will end up when he's finished writing. Then he comes up with the characters and develops them as they encounter the situation. And that's when he lets things just... unfold. The characters tell him where things go.

This may sound crazy to a non-writer: the characters tell him... They're creations of the author, so how could they tell the author what to write? It's a very strange thing when you're writing, and you create characters that take on a life of their own. I've had this experience several times when I was writing Killing Mr. Johnson, the book I worked on for last year's NaNoWriMo (a book I still need to finish). In fact, I was thinking about the story the other day, and trying to understand a character's motivation for doing something in the story. She reacted in a strange way to the chain of events, and as I was thinking about her, I realized why - or rather, she told me why. And it all made sense. I also have a couple of characters who are demanding more attention, so I plan on writing a few more scenes for them.

For this year's NaNoWriMo, I'll be working on a new idea. And I'm going to try taking the Stephen King approach, at least in part. I already know approximately how things are going to end up, and I've been working on developing the characters. In November, I'm going to try just sitting down and writing. We'll see how it goes.

Sunday, October 15, 2017

Statistics Sunday: These Are a Few of My Favorite Posts

You know how sit-coms would occasionally have a flashback episode? There would be some sort of situation where the main characters are stuck somewhere (like an elevator) and as they wait to get out, reminisce about different things that happened over the last season. You got to review the highlights, and the writers (and actors) got a break.

That's what today is: here are some of my favorite posts from the course of my statistics writing - posts people seemed to enjoy or that I had a lot of fun writing.

Statistical Sins: Handing and Understanding Criticism - I'm really enjoying blogging about the Fisher, Pearson, and Neyman feuds. In fact, in line with the new Edison vs. Westinghouse movie, The Current War, I'm thinking of writing my own dramatic account of these feuds. I mean, if they can make alternating current versus direct current seem exciting, just wait until you watch the scene where Neyman escapes Fisher's criticism because Fisher can't speak a word of French. I just need to figure out who Benedict Cumberbatch should play.

Statistics Sunday: Why Is It Called Data Science? - This post generated so much discussion. It's exactly what I've wanted to see from my blog posts: thoughtful discussion in response. Even vehement disagreement is awesome. My professor's heart grew three sizes that day.

Statistical Sins: Three Things We Love - And the greatest of these is still bacon.

Statistics Sunday: What Are Degrees of Freedom? (Part 2) - My favorite type of post to write; one where I learn something by going through the act of explaining it to others.

Statistical Sins: Women in Tech (Here It Goes Again) - In which I rip apart the Google memo, written by a guy who clearly doesn't remember (know anything about) the long history of women in programming and mathematics. Seriously, didn't he at least watch Hidden Figures?

Statistics Sunday: Everyone Loves a Log (Odds Ratio) - Which helped set the stage for a post about Rasch.

Statistics Sunday: No Really, What's Bayes' Got to Do With It? - When I first encountered Bayes' Theorem, I had some trouble wrapping my head around it. So I did the same thing as I did for degrees of freedom: I made myself sit down and write about it. And I finally understand it. Tversky and Kahneman would be so proud.

Statistics Sunday: Null and Alternative Hypotheses - Philosophy of science is one of my favorite topics to pontificate about. It's even more fun for me than debating semantics... and I love debating semantics.

Great Minds in Statistics: F.N. David versus the Patriarchy - Ooo, another movie idea. I very nearly called this post F.N. David versus the Mother F***ing Patriarchy, but decided against it.

That's all for today! This afternoon, I'll be performing Carmina Burana with the Chicago Philharmonic and my choir, the Apollo Chorus of Chicago. And heading out of town tomorrow.

Also, I'm horribly unoriginal: I did this once before. And of course, you can dig through my April 2017 Blogging A to Z Challenge, in which I wrote about the A to Z of statistics.

I'm working on some new statistics Sunday posts. What topics would you like to see here?

Friday, October 13, 2017

Statistical Sins: Hidden Trends and the Downfall of Sears

Without going into too much details, it's been a rough week, so I haven't really been blogging much. But today, I made myself sit down and start thinking about what statistical sin to talk about this week. There were many potential topics. In fact, the recent events in Las Vegas has resulted in a great deal of writing about various analyses, trying to determine whether or not shootings are preventable - specifically by assessing what impact various gun laws have had on the occurrence of these events. Obviously this is a difficult thing to study statistically because these types of shootings are still, in the relative sense, rare, and a classical statistics approach is unlikely to uncover many potential predictors with so little variance to partition. (A Bayesian approach would probably be better.) I may write more on this in the future, but I'll admit I don't have the bandwidth at the moment to deal with such a heavy subject.

So instead, I'll write about a topic my friend over at The Daily Parker also covers pretty frequently: the slow death of Sears. You see, Sears made some really great moves by examining the statistics, but also made some really bad moves by failing to look at the statistics and use the data-drive approaches that allowed its competitors to thrive.

The recession following World War I, combined with an increase in chain stores, threatened Sears's mail order business. It was General Robert Wood, who was incredibly knowledgeable about the U.S. Census and Statistical Abstract, that put Sears back on track by urging them to open brick-and-mortar stores. By the mid-20th century, Sears revenue accounted for 1 percent of U.S. GDP.

But then the market shifted again in the 1970s and 80s, and the decisions Sears made at this time paved the way for its downfall, at least according to Daniel Raff and Peter Temin. As Derek Thompson of CityLab summarizes their insightful essay:
Eager to become America’s largest brokerage, and perhaps even America’s largest community bank, Sears bought the real-estate company Coldwell Banker and the brokerage firm Dean Witter. It was a weird marriage. As the financial companies thrived nationally, their Sears locations suffered from the start. Buying car parts and then insuring them against future damage makes sense. But buying a four-speed washer-dryer and then celebrating with an in-store purchase of some junk bonds? No, that did not make sense.

But the problem with the Coldwell Banker and Dean Winter acquisitions wasn’t that they flopped. It was that their off-site locations didn’t flop—instead, their moderate success disguised the deterioration of Sears’s core business at a time when several competitors were starting to gain footholds in the market.
The largest competitor was Walmart, which not only offered cheap goods, it used data-driven approaches to ensure shelves were stocked with products most likely to sell, and that inventory was determined entirely by those figures. Sears was instead asking local managers to report trends back to headquarters.

As I commented almost exactly 6 years ago (what are the odds of that?), using "big data" to understand the customer is becoming the norm. Businesses unwilling to do this are not going to last.

Sears, sadly, is not going to survive. But Derek Thompson has some advice for Amazon, which he considers today's counterpart for yesterday's Sears.
First, retail is in a state of perpetual metamorphosis. People are constantly seeking more convenient ways of buying stuff, and they are surprisingly willing to embrace new modes of shopping. As a result, companies can’t rely on a strong Lindy Effect in retail, where past success predicts future earnings. They have to be in a state of constant learning.

Second, even large technological advantages for retailers are fleeting. Sears was the logistics king of the middle of the 20th century. But by the 1980s and 1990s, it was woefully behind the IT systems that made Walmart cheaper and more efficient. Today, Amazon now finds itself in a race with Walmart and smaller online-first retailers. Amazon shows few signs of technological complacency, but the company is still only in its early 20s; Sears was overtaken after it had been around for about a century.

Third, there is no strategic replacement for being obsessed with people and their behavior. Walmart didn’t overtake Sears merely because its technology was more sophisticated; it beat Sears because its technology allowed the company to respond more quickly to shifting consumer demands, even at a store-by-store level. When General Robert Wood made the determination to add brick-and-mortar stores to Sears’s mail-order business, his decision wasn’t driven by the pursuit of grandeur, but rather by an obsession with statistics that showed Americans migrating into cities and suburbs inside the next big shopping technology—cars.

Finally, adding more businesses is not the same as building a better business. When Sears added general merchandise to watches, it thrived. When it added cars and even mobile homes to its famous catalogue, it thrived. When it sold auto insurance along with its car parts, it thrived. But then it chased after the 1980s Wall Street boom by absorbing real-estate and brokerage firms. These acquisitions weren’t flops. Far worse, they were ostensibly successful mergers that took management’s eye off the bigger issue: Walmart was crushing Sears in its core business. Amazon should be wary of letting its expansive ambitions distract from its core mission—to serve busy shoppers with unrivaled choice, price, and delivery speed.
 

Monday, October 9, 2017

Complex Models and Control Files: From the Desk of a Psychometrician

We're getting ready to send out a job analysis survey, part of our content validation study. In the meantime, I'm working on preparing control files to analyze the data when we get it back. I won't be running the analysis for a couple weeks, but the model I'll be using is complex enough (in part because I added in some nerdy research questions to help determine best practices for these types of surveys), I decided to start thinking about it now.

I realize there's a lot of information to unpack in that first paragraph. Without going into too much detail, here's a bit of background. We analyze survey data using the Rasch model. This model assumes that an individual's response to an item is a function of his/her ability level and the difficulty level of the item itself. For this kind of measure, where we're asking people to rate items on a scale, we're not measuring ability; rather, we're measuring a trait - an individual's proclivity toward a job task. In this arrangement, items are not difficult/easy but more common/less common, or more important/less important, and so on. The analysis gives us probabilities that people at different ability (trait) levels will respond to an item in a certain way:


It's common for job analysis surveys to use multiple rating scales on the same set of items, such as having respondents go through and rate items on how frequently they perform them, and then go through again and rate how important it is to complete a task correctly. For this kind of model, we use a Rasch Facets model. A facet is something that affects responses to an item. Technically, any Rasch model is a facets model; in a basic Rasch model, there are two facets: respondents (and their ability/trait level) and items. When you're using multiple rating scales, scale is a facet.

And because I'm a nerd, I decided to add another facet: rating scale order. The reason we have people rate with one scale then go through and rate with the second (instead of seeing both at once) is so that people are less likely anchor responses on one scale to responses on another scale. That is, if I rate an item as very frequent, I might also view it as more important when viewing both scales than I would have had I used the scales in isolation. But I wonder if there still might be some anchoring effects. So I decided to counterbalance. Half of respondents will get one scale first, and the other half will get the other scale first. I can analyze this facet to see if it affected responses.

This means we have 4 facets, right? Items, respondents, rating scale, and order. Well, here's the more complex part. We have two different versions of the frequency scale: one for tasks that happen a lot (and assess daily frequency) and one for less common tasks (that assess weekly/monthly frequency). All items use the same importance scale. The two frequency scales have the same number of categories, but because we may need to collapse categories during the analysis phase, it's possible that we'll end up with two very different scales. So I need to factor in that, for one scale, half of items share one common response structure and the other half share the other common response structure, but for the other scale, all items share a common response structure.

I'm working on figuring out how to express that in the control file, which is a text file used by Rasch software to describe all the aspects of the model and analysis. It's similar to any syntax file for statistics software: there's a specific format needed for the program to read the file and run the specified analysis. I've spent the morning digging through help files and articles, and I think I'm getting closer to having a complete control file that should run the analysis I need.

The Voice of the CTA

Whenever I get into a speaking elevator, or follow the sound of "Track 8" to find my way to my Metra home, I wonder about the person behind the voice. Today, a friend shared a great video of Lee Crooks, the voice behind the Chicago Transit Authority:


Now I'll have a face to picture as I'm crowding onto a train to the tune of "Doors closing."

Sunday, October 8, 2017

Statistics Sunday: Why Is It Called 'Data Science'?

In one of the Facebook groups where I share my statistics posts, a user had an excellent question: "Why is it called data science? Isn't any science that uses empirical data 'data science'?"

I thought this was a really good point. Calling this one field data science implies that other scientific fields using data are not doing so scientifically or rigorously. And even data scientists recognize that there's a fair amount of 'art' involved in data science, because there isn't always a right way to do something - there are simply ways that are more justified than others. In fact, I just started working through this book on that very subject:

What I've learned digging into this field of data science, in the hopes of one day calling myself a data scientist, is that statistics is an integral part of the field. Further, data science is a team sport - it isn't necessary (and it may even be impossible) to be an expert in all the areas of data science: statistics, programming, and domain knowledge. As someone with expertise in statistics, I'm likely better off building additional knowledge in statistical analysis used in data science, like machine learning, and building up enough coding knowledge to be able to understand my data science collaborators with expertise in programming.

But that still doesn't answer our main question: why is it called data science? I think what it comes down to is that data science involves teaching (programming) computers to do things that once had to be done by a person. Statistics as a field has been around much longer than computers (and I mean the objects called computers, not the people who were once known as computers). In fact, statistics has been around even prior to mechanical calculators. Many statistical approaches didn't really need calculators or computers. It took a while, but you could still do it by hand. All that was needed was to know the math behind it. And that is how we teach computers - as long as we know the math behind it, we can teach a computer to do just about anything.

First, we were able to teach computers to do simple statistical analyses: descriptives and basic inferential statistics. A person can do this, of course; a computer can just do it faster. We kept building up new statistical approaches and teaching computers to do those analyses for us - complex linear models, structural equation models, psychometric approaches, and so on.

Then, we were able to teach computers to learn from relationships between words and phrases. Whereas before we needed a person to learn the "art" of naming things, we developed the math behind it and taught it to computers. Now we have approaches like machine learning, where you can feed in information to the computer (like names of paint shades or motivational slogans) and have the computer learn how to generate that material itself. Sure, the results of these undertakings are still hilarious and a long way away from replacing people, but as we continue to develop the math behind this approach, computers will get better.

Related to this concept (and big thanks to a reader for pointing this out) is the movement from working with structured data to unstructured data. Once again, we needed a person to enter/reformat data so we could work with it; that's not necessary anymore.

So we've moved from teaching computers to work with numbers to words (really any unstructured data). And now, we've also taught computers to work with images. Once again, you previously needed a person to go through pictures and tag them descriptively; today, a computer can do that. And as with machine learning, computers are only going to get better and more nuanced in their ability to work with images.

Once we know the math behind it, we can teach a computer to work with basically any kind of data. In fact, during the conference I attended, I learned about some places that are working with auditory data, to get computers to recognize (and even translate in real-time) human languages. These were all tasks that needed a human, because we didn't know how to teach the computers to do it for us. That's what data science is about. It still might not be a great name for the field, but I can understand where that name is coming from.

What are you thoughts on data science? Is there a better name we could use to describe it? And what do you think will be the next big achievement in data science?

Friday, October 6, 2017

Reading Challenges and Nobel Prizes

This year, I decided to double last year's reading challenge goal on Goodreads. I've challenged myself to read 48 books this year. I'm doing really well!


This morning, I started Never Let Me Go by Kazuo Ishiguro, which was highly recommended by a friend. Yesterday, that same friend let me know that Kazuo Ishiguro is being awarded the Nobel Prize in Literature:
Mr. Ishiguro, 62, is best known for his novels “The Remains of the Day,” about a butler serving an English lord in the years leading up to World War II, and “Never Let Me Go,” a melancholy dystopian love story set in a British boarding school. He has obsessively returned to the same themes in his work, including the fallibility of memory, mortality and the porous nature of time. His body of work stands out for his inventive subversion of literary genres, his acute sense of place and his masterly parsing of the British class system.

“If you mix Jane Austen and Franz Kafka then you have Kazuo Ishiguro in a nutshell, but you have to add a little bit of Marcel Proust into the mix,” said Sara Danius, the permanent secretary of the Swedish Academy.

At a news conference at his London publisher’s office on Thursday, Mr. Ishiguro was characteristically self-effacing, saying that the award was a genuine shock. “If I had even a suspicion, I would have washed my hair this morning,” he said.

He added that when he thinks of “all the great writers living at this time who haven’t won this prize, I feel slightly like an impostor.”
BTW, I just added a Goodreads widget to my blog to show what I'm currently reading.

Thursday, October 5, 2017

This is Pretty Grool

In the movie Mean Girls, Aaron (the love interest) asks Cady (the heroine) what day it is, and she responds October 3rd. Hence, October 3rd was dubbed "Mean Girls Day," and people celebrate by posting Mean Girls memes, watching the movie, and probably wearing pink.

This year, 4 members of the cast released this video, asking fans to help victims of the Las Vegas shooting. Here it is:

Wednesday, October 4, 2017

Statistical Sins: Stepwise Regression

This evening, I started wondering: what do other statisticians think are statistical sins? So I'm perusing message boards on a sleepless Tuesday night/Wednesday morning, and I've found one thing that pops up again and again: stepwise regression.

No stairway. Denied.
Why? Stepwise regression is an analysis process in which one adds or subtracts predictors in a regression equation based on whether they are significant or not. There are, then, two types of stepwise regression: forwards and backwards.

In either analysis, you would generally choose your predictors ahead of time. But then, there's nothing that says you can't include far more predictors than you should (that is, more than the data can support), or predictors that have no business being in a particular regression equation.

In forward stepwise regression, the program would select the variable among your identified predictors that is most highly related to the outcome variable. Then it adds the next most highly correlated predictor. It keeps doing this until additional predictors result in no significant improvement of the model (significant improvement being determine by change in R2).

In backward stepwise regression, the program includes all of your predictor variables, then begins removing variables with the smallest effect on the outcome variable. It stops when removing a variable results in a significant decrease in explained variance.

As you can probably guess, this analysis approach is rife with the potential of false positives and chance relationships. Many of the messages boards said, rightly, there is basically no situation where this approach is justified. It isn't even good exploratory data analysis; it's just lazy.

But is there a way this analysis technique could be salvaged? Possibly, if one took a page from the exploratory data analysis playbook and first plotted data, examined potential confounds and alternative explanations for relationships between variables, then made an informed choice about the variables to include in the analysis.

And, most importantly, the analyst should have a way of testing a stepwise regression procedure in another sample, to verify the findings. Let's be honest; to use a technique like this one, where you can add in any number of predictors, you should have a reasonably large sample size or else you should find a better statistic. Therefore, you could randomly split your sample into a development sample, where you determine best models, and a testing sample, where you confirm the models created through the development sample. This approach is often used in data science.

BTW, I've had some online conversations with people about the term data science and I've had the chance to really think about what it is and what it means. Look for more on that in my next Statistics Sunday post!

What do you think are the biggest statistical sins?

Tuesday, October 3, 2017

Free Tools for Meta-Analysis

My boss is attending a two-day course on meta-analysis, and shared these tools with me, available through Brown School of Health:
  • The Systematic Review Data Repository - as the name suggests, this is a repository of systematic review data, so you pull out data relevant to your own systematic review as well as contribute your own data for others to use. Systematic reviews are a lot of work, so a tool that lets you build off of the work of others can help systematic reviews be performed (and their findings disseminated and used to make data-driven decisions) much more quickly
  • Abstrackr - a free, open-source tool for the citation screening process. Conducting a systematic review or meta-analysis involves an exhaustive literature review, and those citations then have to be inspected to see if they qualify to be included in the study. It isn't unusual to review 100s of studies only to include a couple dozen (or fewer). This tool lets you upload abstracts, and invite reviewers to examine abstracts for inclusion. This tool is still in beta, but they're incorporating machine learning to automate some of the screening process in the future. Plus, they use "automagically" in the description, which is one of my favorite portmanteaus.
  • Open Meta-Analyst - another free, open-source tool for conducting meta-analysis. You can work with different types of data (binary, continuous, diagnostic), conduct fixed- or random-effects models, and even use different estimation methods, like maximum likelihood or Bayesian. 
  • Open MEE - a free, open-source tool based on Open Meta-Analyst, with extra tools for ecological and evolutionary meta-analysis. This might be the tool to use in general, because it has the ability to conduct meta-regression with multiple covariates. 
I think of all of these, I'm looking forward to trying out Abstrackr the most.


And of course, there are many great meta-analysis packages for R. I'm currently working on a methods article describing how to conduct a mini meta-analysis to inform a power analysis using R tools - something I did for my dissertation, but not something everyone knows how to do. (By working, I mean I have an outline and a few paragraphs written. But I'm hoping to have more time to dedicate to it in the near future. I'm toying with the idea of spending NaNoWriMo this year on scholarly pursuits, rather than a novel.)

BTW, if you like free stuff, check out these free data science and statistics resources (and let me know if you know of any not on the list).

Monday, October 2, 2017

Mantis Follow-Up

I did a bit of research on the praying mantis, trying to find out all about the one I saw behind my apartment building yesterday:


First, I found out it's a Chinese mantis.

Second, I'm pretty sure this one is male. He's long and skinny, like many of the males I saw in pictures. Female mantises (mantids?) are generally bigger and wider. As a male mantis, he'll be doing a great job keeping the back porch free of moths and whatnot, and won't be doing some of the terrifying things female mantises do, like eating hummingbirds or, you know, their mate.

I didn't see him when I came home from dinner last night, but he was back in his same spot this morning. If I see him again, I'll have to name him and consider him the unofficial mascot of my building.

In the interest of being prepared, what should his name be?

Sunday, October 1, 2017

Cool Sighting Behind My Apartment Building

I've never seen a praying mantis in person - maybe behind glass once, but never out in the wild. This afternoon, I found one hanging around (literally) behind my apartment building. After snapping a quick picture with my phone, I ventured back out with a nicer camera; thankfully, it was still there:




Statistics Sunday: Free Data Science and Statistics Resources

I'm working on building up a list of some free resources for data science and/or statistics. Through the data science conference and a book I recently finished, I've learned about some awesome resources already - I know there's more out there, but I wanted to share what I've found so far. This will remain a living document that I'll continue to update as I discover more resources.


Data Science E-Books 

Many of these books are statistically-oriented, but then a big part of data science involves drawing conclusions from the data. Hence, the line between the list below and the next list on statistics resources may be a bit blurry.
  • Analyze Survey Data for Free edited by Anthony Joseph Damico - this edited online resource, which assumes knowledge of R, offers step-by-step instructions for exploring online survey data; entries are contributed by different users and some entries are still awaiting a contributor if you're so inclined!
  • Think Python by Allen B. Downey - an introduction to one of the most popular programming languages for data science, Python
  • Think Stats: Exploratory Data Analysis in Python by Allen B. Downey - an intro to stats and probability using Python, written by the same author as Think Python above; while this book is meant to introduce statistics to programmers, it could also be a good way for statisticians to get their feet wet in Python
  • Deep Learning by Ian Goodfellow, Yoshua Bengio, & Aaron Courville - a free e-book on machine learning, specifically deep learning
  • R for Data Science by Garrett Grolemund & Hadley Wickham - this book teaches you how to pull data into R, and clean, model, and visualize; this book was definitely talked up at the data science conference (thanks to a reader for sharing the link to the free e-book version!)
  • Ten Signs of Data Science Maturity by Peter Guerra & Kirk Borne - Borne's was one of my favorite presentations from the data science conference I attended; this e-book highlights what indicates an organization is ready to venture into data science 
  • The Elements of Statistical Learning Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, & Jerome Friedman - predictive modeling and machine learning approaches
  • An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, & Robert Tibshirani - covers many of the same topics as Elements above, but geared more toward beginners in statistical learning; if these are new concepts for you, read this book before Elements of Statistical Learning
  • Python Programming WikiBook - another introduction to Python, which also includes extensions into other programming languages and additional resources/links
  • R Programming WikiBook - an introduction to programming in R, another popular programming language for data science
  • School of Data Handbook - this handbook, which goes along with the courses available through School of Data, offers recipes for scraping, cleaning, and filtering data to get you started on your data science journey

Statistics E-Books

  • Correlation and Causation: The Trouble with Story Telling by Lee Baker - a sort of follow-up to my previous discussion of spurious correlations, this book discusses the notion of probability and alternative explanations for correlations
  • The Probability Cheatsheet by William Chen - technically not an e-book; it's a short PDF document that summarizes key probability concepts, like Simpson's paradox, the Law of Large Numbers, and conditional probability
  • OpenIntro Statistics by David M. Diaz, Christopher D. Barr, & Mine Çetinkaya-Rundel - a free introductory statistics textbook and additional statistical resources
  • Think Bayes: Bayesian Statistics Made Simple by Allen B. Downey - yet another free e-book from Downey (see Think Python and Think Stats above), introducing Bayes in mathematical notation (if you prefer mathematical notation when learning stats; not everyone does); it also uses Python for computer-aided analysis, so this book also straddles the statistics-data science line
  • Research and Statistical Support Services Short Courses by Richard Herrington & Jonathan Starkweather - also not exactly an e-book: this site, part of the R&SS at University of North Texas, contains multiple short documents teaching the basics of statistical software, and a few other computer tools that could aid in research
  • How to Share Data with a Statistician by Jeff Leek - this GitHub document describes how to format data to be shared with a statistician, in order to facilitate efficient and timely analysis
  • Introduction to Applied Bayesian Statistics and Estimation for Social Scientists by Scott M. Lynch - an introduction to Bayesian analysis and the use of what are called MCMC (Markov chain Monte Carlo) methods; this book starts with a refresher of classical statistics before introducing the Bayesian notion of probability
  • Learning Statistics with R by Daniel Navarro - what started off as lecture notes for an introductory statistics class taught with R became an e-book; there's even an R package (lsr) to go along with the book
Do you have any free resources you would recommend?

Like free stuff? Here are some free meta-analysis tools.

Saturday, September 30, 2017

Saturday Video Roundup: Bad Books, Space Ghosts, and Dog Islands

It's been a long week, so today I've been enjoying some funny videos. First of all, the hilarious Jenny Nicholson has been on a mission to find the worst book on Amazon - and I think she succeeded:


Honest Trailers did Star Trek: The Next Generation, and I spent the entire time watching the video laughing and/or saying, "OMG, I remember that episode!":


Finally, I discovered this movie, that I DEFINITELY have to see:


BTW, if the trailer itself doesn't convince you, check out that cast.

Friday, September 29, 2017

Just You Wait: Hamilton, Madison, and the Federalist Papers

Completely by accident, the last book I read and the one I'm about to finish have a common story - a study in the 1950s and 1960s that attempted to answer the question: who wrote the 12 Federalist Papers with disputed authorship, Alexander Hamilton or James Madison? First, some background, if you're not familiar with any of this.

By Publius (pseudonym) [Alexander Hamilton, John Jay, James Madison]. - http://www.americaslibrary.gov/aa/madison/aa_madison_father_2_e.html., Public Domain, Link
In 1787 and 1788, a series of 85 essays, collectively known as "The Federalist" or "The Federalist Papers," were published under the pseudonym, Philo-Publius. These essays were intended to support ratifying the Constitution, and influence that voting process. It was generally known that the essays were written by three Founding Fathers: Alexander Hamilton, the 1st US Secretary of the Treasury; John Jay, the 1st Chief Justice of the US Supreme Court; and James Madison, 5th US Secretary of State and 4th US President. The question is, who wrote which ones?

The authorship was not in question for 73 of the essays; each of these essays had a unique member of the trio claiming authorship in the form of a list shared with the public later (in some cases, following the individual's death). The problem is that for 12 essays, both Hamilton and Madison claimed authorship.

Historians have debated this issue for a very long time. In the 1950s, two statisticians, Frederick Mosteller and David L. Wallace, decided to tackle the problem with data: the words themselves. I learned about the study, which produced an article (available here) and a book, first in Nabokov's Favorite Word is Mauve. In fact, that was Ben Blatt's inspiration for book, which involved analysis of the word usage patterns (as well as a few other interesting analyses) of literary and mainstream fiction.

But it was through the book I'm reading now that I learned their approach was Bayesian. I've written about Bayes theorem (and twice more). Its focus is on conditional probability - the probability one thing will happen given another thing has happened. Bayesian statistics, or what's sometimes called Bayesian inference, uses these conditional probabilities, and allows analysts to draw upon other previously collected probabilities (called priors) that may be subjective (e.g., expert opinion, equal odds) or empirically based. Those prior probabilities are then used with the observed data to derive a posterior probability. Bayes was frequently used by cryptanalysts, including the code breakers at Bletchley Park (such as Alan Turing) who broke the Enigma code.

Mosteller and Wallace started off with subjective priors - they went in with the prior that each of the 12 disputed essay was equally likely to have been written by Hamilton or Madison. Then, they set out analyzing the known essays for word usage patterns. This also provided prior probabilities. They found that Madison used 'whilst' and Hamilton used 'while.' Hamilton used 'enough' but Madison never did. They then examined the disputed essays, using these word usage patterns to test alternative scenarios: This essay was written by Madison versus This essay was written by Hamilton. They found that, based on word usage patterns, the 12 essays were written by Madison, meaning Madison wrote 29 of the essays. This still leaves Hamilton with a very impressive 51.

Overall, I highly recommend checking out The Theory that Would Not Die. I'll have a full review on Goodreads once I read the last 20 or so pages. And I think I'm ready to finally tackle learning Bayesian inference. I already have a book on the subject.

Thursday, September 28, 2017

Wear Your Best Pant Suit

File this under awesome:

Statistical Sins: Nicolas Cage Movies Are Making People Drown and More Spurious Correlations

As I posted yesterday, I attended an all-day data science conference online. I have about 11 pages of typed notes and a bunch of screenshots I need to weed through, but I'm hoping to post more about the conference, my thoughts and what I learned, in the coming days.

At work, I'm knee-deep in my Content Validation Study. More on that later as well.

In the meantime, for today's (late) Statistical Sins, here's a great demonstration of why correlation does not necessarily infer anything (let alone causation). I can't believe I didn't discover this site before now: Spurious Correlations. Here are some of my favorites:




As I mentioned in a previous post, a correlation - even a large correlation - can be obtained completely by chance. Statistics are based entirely on probabilities, and there's always a probability that we can draw the wrong conclusion. In fact, in some situations, that probability may be very high (even higher than our established Type I error rate). 

This is a possibility we always have to accept; we may conduct a study and find significant results completely by chance. So we never want to take a finding in isolation too seriously. It has to be further studied and replicated. This is why we have the scientific method, which encourages transparency of methods and analysis approach, critique by other scientists, and replication.

But then there's times we just run analyses willy-nilly, looking for a significant finding. When it's done for the purpose of the Spurious Correlation website, it's hilarious. But it's often done in the name of science. As should be demonstrated above, we must be very careful when we go fishing for relationships in the data. The analyses we use will only tell us the likelihood we would find a relationship of that size by chance (or, more specifically, if the null hypothesis is actually true). It doesn't tell us if the relationship is real, no matter how small the p-value. When we knowingly cherry pick findings and run correlations at random, we invite spurious correlations into our scientific undertaking. 

This approach violates a certain kind of validity, often called statistical conclusion validity. We maximize this kind of validity when we apply the proper statistical analyses to the data and the question. Abiding by the assumptions of the statistic we apply is up to us. The statistics don't know. We're on the honor system here, as scientists. Applying a correlation or any statistic without any kind of prior justification to examine that relationship violates assumptions of the test.

So I'll admit, as interested as I am in the field of data science, I'm also a bit concerned about the high use of exploratory data analysis. I know there are some controls in place to reduce spurious conclusions, such as using separate training and test data, so I'm sure as I find out more about this field, I'll become more comfortable with some of these approaches. More on that as my understanding develops.

Wednesday, September 27, 2017

Data Science Today, Statistical Sins Tomorrow

Hi all! I'm attending an all-day data science conference, so I won't be able to post my regular Statistical Sins post. Check back tomorrow! In the meantime, here's my new favorite quote I learned through the conference:
"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write."

--H.G. Wells

Tuesday, September 26, 2017

Banned Books Week: What Will You Be Reading?

September 24-30 is Banned Books Week.

Alison Flood of The Guardian explains:
[Banned Books Week] was launched in the US in 1982, to mark what the American Library Association (ALA) said was a sudden surge in attempts to have books removed or restricted in schools, bookshops and libraries. Since then, more than 11,300 books have been “challenged”, with last year’s most controversial title the award-winning graphic novel This One Summer by Mariko and Jillian Tamaki – “because it includes LGBT characters, drug use and profanity, and it was considered sexually explicit with mature themes,” said the ALA.

Almost all of the books on its annual list of challenged books are picture books and young adult novels, flagged because of sexual content, transgender characters or gay relationships. The only exception on this year’s list, the Little Bill series, was challenged because of the high-profile sexual harassment claims against their author, comedian Bill Cosby.
The Banned Books Week website offers many resources, including where to celebrate.

What will you be reading?

Monday, September 25, 2017

Conductor's Notes

When I'm not working or writing about statistics, I'm singing with the Apollo Chorus of Chicago. (I also serve on the board as Director of Marketing.) Last week, our PR Director sat down with Music Director, Stephen Alltop, to discuss our upcoming season.



Our first concert is at 2:30pm on Sunday, November 5th, at Holy Family Church in Chicago. Admission is FREE! So if you're in the Chicago-area, come check us out!

Sunday, September 24, 2017

Statistics Sunday: What is a Content Validation Study?

I've been a bit behind on blogging this week because we're starting up a content validation study for multiple exams at work. A content validation study is done to ensure the topics on a measure are relevant to the measure subject - basically we want to make sure we have all the content we should have (relevant content) and none of the content we shouldn't have (irrelevant content). For instance, if you were developing the new version of the SAT, you'd want to make sure that everything on the quantitative portion is relevant to the domain (math) and covers all the aspects of math that are important for a person going from high school to college.

For certification and licensing exams, the focus is on public safety. What tasks or knowledge are important for this professional to know in order to protect the public from harm? That helps narrow down the potential content. From there, we have many different approaches to find out what topics are important.

The first potential way is bringing in experts: people who have contributed to the knowledge base in that field, perhaps as an educator or researcher, or someone who has been in a particular field for a very long time. There are many potential ways to get information from them. You could interview them one-on-one, or have a focus group. You could use some of the more formal consensus-building approaches, like a Delphi panel. Or you could bring your experts in at different stages to influence and shape information obtained from another source.

Another potential way is to collect information on how people working in the field spend their time. This approach is often known as job analysis. Once again, there are many ways you can collect that information. You can shadow and observe people as they work, doing a modified version of a time-motion study. You could conduct interviews or focus groups with people working in the field. Or you could field a large survey, asking people to rate how frequently they perform a certain task and/or how essential it is to do that task correctly.

A related approach is to examine written materials, such as job descriptions, to see what sorts of things a person is expected to know or do as part of the job.

Of course, content validation studies are conducted for a variety of measures, not just exams. When I worked on a project in VA to develop a measure of comfort and tolerability of filtering respirator masks, we performed a multi-stage content validation study, using many of the approaches listed above. First, we checked the literature to see what research has been performed on comfort (or rather, discomfort) with these masks. We found that studies had shown people experienced things like sweaty faces and heat buildup, with some extreme outcomes like shortness of breath and claustrophobia. We created a list of everything we found in the literature, and wrote open-ended questions about them. Then, we used these questions to conduct 3 focus groups with healthcare workers who had to wear these masks as part of their jobs - basically anyone who works with patients in airborne isolation.

These results were used to develop a final list of symptoms and reactions people had as a result of wearing these masks, and we started writing questions about them. We brought in more people at different stages of the draft to look at what we had, provide their thoughts on the rating scales we used, and tell us whether we had all the relevant topics covered on the measure (that is, were we missing anything important or did we have topics that didn't fit?). All of these approaches help to maximize validity of the measure.

This is an aspect of psychometrics that isn't really discussed - the importance of having a good background in qualitative research methods. Conducting focus groups and interviews well takes experience, and being able to take narratives from multiple people and distill them down to key topics can be challenging. A knowledge of survey design is also important. There's certainly a methodological side to being a psychometrician - something I hope to blog about more in the future!

Saturday, September 23, 2017

Today's Google Doodle Honoring Asima Chatterjee

Today would have the 100th birthday of Dr. Asima Chatterjee, the first Indian woman to earn a doctorate of science (Sc.D.) from an Indian university. And she's being honored in one of today's Google Doodles.

Wikipedia/Creative Commons, Fair use, Link
In fact, she broke the glass ceiling in many ways:
Despite resistance, Chatterjee completed her undergraduate degree in organic chemistry and went on to win many honours including India’s most prestigious science award in 1961, the annual Shanti Swarup Bhatnagar Prize for her achievements in phytomedicine. It would be another 14 years before another woman would be awarded it again.

According to the Indian Academy of Sciences, Chatterjee “successfully developed anti-epileptic drug, Ayush-56 from Marsilia minuta and the anti-malarial drug from Alstonia scholaris, Swrrtia chirata, Picrorphiza kurroa and Ceasalpinna crista.”

Her work has contributed immensely to the development of drugs that treat epilepsy and malaria.

She was elected as the General President of the Indian Science Congress Association in 1975 – in fact, she was the first woman scientist to be elected to the organisation.

An outstanding contribution was her work on vinca alkaloids, which come from the Madagascar periwinkle plant. They are used in chemotherapy to assist in slowing down and halting cancer cells duplicating.
There are actually 2 Google Doodles today. The other celebrates Saudi Arabia National Day.

Friday, September 22, 2017

Statistical Sins Late Edition: Looking for Science in All the Wrong Places

Recently, the Pew Research Center released survey results on where people look for and how they evaluate science news.

First, some good news. About 71% of respondents said they found science news somewhat or very interesting, and 66% read science news at least a few times per month. Curiosity is the primary reason people said they attend to science news.

And more good news. People recognize that the general media is not a great place to science news, with 62% believing it gets science information correct about half the time (or less).

The bad news? Over 54% said general news is where they tend to get science information. And worse - most (68%) don't actively seek science news out, instead just encountering it by chance.


What does it say that the typical respondent gets their science news from a source they know can be very inaccurate and biased?

And as we've observed before with other topics, many beliefs about science and science reporting have become entangled with political ideology, with people considering themselves a republican less interested in scientific research on evolution, and also more likely to believe that science researchers overstate the implications of their research.

This suggests that not only is it important to improve scientific literacy, it's also important for people with a good understanding of science to share and disseminate science news and information with the public. Most people are not going to seek out better sources, so even if they know the usual suspects are biased, they're still only seeing those sources. And it's possible that, even though they recognize the bias, the sources they're encountering are even more biased than they think. While we can't make anyone trust a source, we can at least give them the tools and sources they need to hear (and hopefully understand) the full story.

You can read the full report, with frequencies of responses, here.

Thursday, September 21, 2017

Great Minds in Statistics: Georg Rasch

Happy birthday, Georg Rasch, who would have been 116 today! Rasch was a Danish mathematician who contributed a great deal to statistics, and specifically a branch of statistics known as psychometrics. In fact, his most important contribution bears his name - the Rasch measurement model, which is a model that we use to analyze measures of ability (and later, other characteristics like personality).


Explaining everything about the Rasch model and using it to analyze measurement data would take more blog posts than this one. (Don't worry, reader, I'm planning a series of posts.) But my goal for today is to explain the basic premise and tie it to things I've blogged about previously.

The Rasch model was originally developed for ability tests, where the outcomes are correct (1) or incorrect (0). Remember binary outcomes? This is one of those instances where the raw scale is both binary and ordinal. But you don't use items in isolation. You use them as part of a measure: the SAT, the GRE, a cognitive ability test.

So you might argue the outcome is no longer ordinal. It's the total number of correct answers. But not all items are created equal. Some items are easier than others. And for adaptive exams, which determine the next item to administer based on whether the previous item was answered correctly, you need to take into account item difficulty to figure out a person's score.

Adding a bunch of ordinal variables together doesn't necessarily mean you have a continuous variable. That final outcome could also be ordinal.

Rasch developed a logarithmic model that converts ordinal scales to interval scales. Each individual item now has an interval scale measure, and overall score (number of items correct) also has an interval scale. It does this by converting to a scale known as logits, which are log odds ratios. Item difficulty (the interval scale for items) and person ability (the interval scale for overall score) are placed on the same metric, so that you can easily determine whether a person will respond to a certain item correctly. If your ability, on the Rasch scale, is a logit of 1.5, that means an item of difficulty 1.5 on the Rasch scale is perfectly matched to your ability.

What does that mean in practice? It means you have a 50% chance of responding correctly to that item. That's how person ability is typically determined in the Rasch model; based on the way you respond to questions, your ability becomes the difficulty level where you answer about half the questions correctly.

Even better, if I throw an item of difficulty 1.3 at you, you'll have a better than 50% chance of responding.

But I can be even more specific about that. Why? Because these values are log odds ratios and there's a great reason person ability and item difficulty are on the same scale. First, I subtract item difficulty (which we symbolize as D) from your ability level (which we symbolize as B): B-D = 1.5-1.3. The resulting different (0.2) is also a log odds ratio. It is the log transformation of the odds ratio that a person of B ability will answer an item of D difficulty correctly. I convert that back to a proportion, to get the probability that you will answer the item correctly, using this equation:


where P(X=1) refers to the probability of getting an item correct. This equation is slightly different from the one I showed you in the log odds ratio post (which was the natural number e raised to the power of the log odds ratio). Remember that equation was to convert a log odds ratio back to an odds ratio. This equation includes one additional step to convert back to a proportion.

If I plug my values into this equation, I can tell that you have a 55% chance of getting that question correct. This is one of the reasons the Rasch model does beautifully with missing data (to a point); if I know your ability level and the difficulty level of an item, I can compute how you most likely would have responded.

Stay tuned for more information on Rasch later! And be sure to hug a psychometrician today!

Monday, September 18, 2017

Words, Words: From the Desk of a Psychometrician

I've decided to start writing more about my job and the types of things I'm doing as a psychometrician. Obviously, I can't share enough detail for you to know exactly what I'm working on, but I can at least discuss the types of tasks I encounter and the types of problems I'm called to solve. (And if you're curious what it takes to be a psychometrician, I'm working on a few posts on that topic as well.)

This week's puzzle: examining readability of exam items. Knowing as much as we do about the education level of our typical test-taker - and also keeping in mind that our exams are supposed to measure knowledge of a specific subject matter, as opposed to reading ability - it's important to know how readable are the exam questions. This information can be used when we revise the exam, and could also be used to update our exam item writing guides (creating a new guide is one of my ongoing projects).

Anyone who has looked at the readability statistics in Word knows how to get Flesch-Kinkaid statistics: reading ease and grade level. Reading ease, which was developed by Rudolph Flesch, is a continuous value based on the average number of words per sentence and average number of syllables per word; higher scores mean the text is easier to read. The US Army, led by researcher John Kinkaid, created grade levels based on the reading ease metric. So the grade-level you receive through your analysis reflects the level of education necessary to comprehend that text.

And to help put things in context, the average American reads at about a 7th grade level.

The thing about Flesch-Kinkaid is that it isn't always well-suited for texts on specific subject matters, especially those that have to use some level of jargon. In dental assisting, people will encounter words that refer to anatomy or devices used in dentistry. These multisyllabic words may not be familiar to the average person, and may result in higher Flesch-Kinkaid grade levels (and lower reading ease), but when placed in the context for practicing dental assistants - who would learn these terms in training or on-the-job - they're not as difficult. And as others have pointed out, there are common multisyllabic words that aren't difficult. Many people - even people with low reading ability - probably know words like "interesting" (a 4-syllable word).

So my puzzle is to select readability statistics that are unlikely to be "tricked" by jargon, or at least find some way to take that inflation into account. I've been reading about some of the other readability statistics - such as the Gunning FOG index, where FOG stands for (I'm not kidding) "Frequency of Gobbledygook." Gunning FOG is very similar to Flesch-Kinkaid: it also takes into account average words per sentence and, instead of average syllables, looks at average number of complex (3+ syllables) words. But there are other potential readability statistics to explore. One thing I'd love to do is to generate a readability index for each item in our exam pools. The information, along with difficulty of the item and how it maps onto exam blueprints, could become part of item metadata. But that's a long-term goal.

To analyze the data, I've decided to use R (though Python and its Natural Language Processing tools are another possibility). Today I discovered the koRpus package (R package developers love capitalizing the r's in package names). And I've found the readtext package that can be used to pull in and clean text from a variety of formats (not just txt, but JSON, xml, pdf, and so on). I may have to use these tools for a text analysis side project I've been thinking of doing.

Completely by coincidence, I also just started reading Nabokov's Favorite Word is Mauve, in which author Ben Blatt uses different text analysis approaches on classic and contemporary literature and popular bestsellers. In the first chapter, he explored whether avoidance of adverbs (specifically the -ly adverbs, which are despised by authors from Ernest Hemingway to Stephen King) actually translates to better writing. In subsequent chapters, he's explored differences in voice by author gender, whether great writers follow their own advice, and how patterns of word use can be used to identify authors. I'm really enjoying it.


Edit: I realized I didn't say more about getting Flesch-Kinkaid information from Word. Go to Options then Proofing and select "Show readability statistics." You'll receive a dialogue box with this information after you run Spelling and Grammar Check on a document.

Sunday, September 17, 2017

Statistics Sunday: What Are Degrees of Freedom? (Part 2)

Last week, in part 1, I talked about degrees of freedom as the number of values that are free to vary. This is where the name comes from, of course, and this is still true in part 2, but there’s more to it than that, which I’ll talk about today.

In the part 1 example, I talked about why degrees of freedom for a t-test is smaller than sample size – 2 fewer to be exact. It’s because all but the last value in each group is free to vary. Once you get to that last value in determining the group mean, that value is now determined – from a statistical standpoint, that is. But that’s not all there is to it. If that was it, we wouldn’t really need a concept of degrees of freedom. We could just set up the table of t critical values by sample size instead of degrees of freedom.

And in fact, I’ve seen that suggested before. It could work in simple cases, but as many statisticians can tell you, real datasets are messy, rarely simple, and often require more complex approaches. So instead, we teach concepts that become relevant in complex cases using simple cases. A good way to get your feet wet, yes, but perhaps a poor demonstration of why these concepts are important. And confusion about these concepts - even among statistics professors - remains, because some of these concepts just aren't intuitive.

Degrees of freedom can be thought as the number of independent values that can be used for estimation.

Statistics is all about estimation, and as statistics become more and more complex, the estimation process also becomes more complex. Doing all that estimating requires some inputs. The number of inputs places a limit on how many things we can estimate, our outputs. That’s what your degrees of freedom tells you – it’s how many things you can estimate (output) based on the amount of data you have to work with (input). It keeps us from double-dipping - you can't reuse the same information to estimate a different value. Instead, you have to slice up the data in a different way.

Degrees of freedom measures the statistical fuel available for the analysis.

For analyses like a t-test, we don’t need to be too concerned with degrees of freedom. Sure, it costs us 1 degree of freedom for each group mean we calculate, but as long as we have a reasonable sample size, those 2 degrees of freedom we lose won't cause us much worry. We need to know degrees of freedom, of course, so we know which row to check in our table of critical values – but even that has become an unnecessary step thanks to computer analysis. Even when you’re doing a different t-test approach that alters your degrees of freedom (like Welch’s t, which is used when the variances between your two groups aren’t equal – more on that test later, though I've mentioned it once before), it’s not something statisticians really pay attention to.

But when we start adding in more variables, we see our degrees of freedom decrease as we begin using those degrees of freedom to estimate values. We start using up our statistical fuel.

And if you venture into even more complex approaches, like structural equation modeling (one of my favorites), you’ll notice your degrees of freedom can get used up very quickly – in part because your input for SEM is not the individual data but a matrix derived from the data (specifically a covariance matrix, which I should also blog about sometime). That was the first time I remember being in a situation where my degrees of freedom didn't seem limitless, where I had to simplify my analysis because I had used up all my degrees of freedom, and not just once. Even very simple models could be impossible to estimate based on the available degrees of freedom. I learned that degrees of freedom isn’t just some random value that comes along with my analysis.

It’s a measure of resources for estimation and those resources are limited.

For my fellow SEM nerds, I might have to start referring to saturated models – models where you’ve used up every degree of freedom – as “out of gas.”

Perhaps the best way to demonstrate degrees of freedom as statistical fuel is by showing how degrees of freedom are calculated for the analysis of variance (ANOVA). In fact, it was Ronald Fisher who came up with both the concept of degrees of freedom and the ANOVA (and the independent samples t-test referenced in part 1 and again above). Fisher also came up with the correct way to determine degrees of freedom for Pearson’s chi-square – much to the chagrin of Karl Pearson, who was using the wrong degrees of freedom for his own test.

First, remember that in ANOVA, we’re comparing our values to the grand mean (the overall mean of everyone in the sample, regardless of which group they fall in). Under the null hypothesis, this is our expected value for all groups in our analysis. That by itself uses 1 degree of freedom – the last value is no longer free to vary, as discussed in part 1 and reiterated above. (Alternatively, you could think of it as spending 1 degree of freedom to calculate that grand mean.) So our total degrees of freedom for ANOVA is N-1. That's always going to be our starting point. Now, we take that quantity and start partitioning it out to each part of our analysis.

Next, remember that in ANOVA, we’re looking for effects by partitioning variance – variance due to group differences (our between groups effect) and variance due to chance or error (our within group differences). Our degrees of freedom for looking at the between group effect is determined by how many groups we have, usually called k, minus 1.

Let’s revisit the movie theatre example from the ANOVA post.

Review all the specifics here, but the TL;DR is that you're at the movie theatre with 3 friends who argue about where to sit in the theatre: front, middle, or back. You offer to do a survey of people in these different locations to see which group best enjoyed the movie, because you're that kind of nerd.

If we want to find out who had the best movie-going experience of people sitting in the front, middle, or back of the theatre, we would use a one-way ANOVA comparing 3 groups. If k is 3, our between groups degrees of freedom is 2. (We only need two because we have the grand mean, and if we have two of the three group means - the between groups effect - we can figure out that third value.)

We subtract those 2 degrees of freedom from our total degrees of freedom. If we don’t have another variable we’re testing – another between groups effect – the remaining degrees of freedom can all go toward estimating within group differences (error). We want our error degrees of freedom to be large, because we take the total variance and divide it by the within group degrees of freedom. The more degrees of freedom we have here, the smaller our error, meaning our statistic is more likely to be significant.

But what if we had another variable? What if, in addition to testing the effect of seat location (front, middle, or back), we also decided to test the effect of gender? We could even test an interaction between seat location and gender to see if men and women have different preferences on where to sit in the theatre. We can do that, but adding those estimates in is going to cost us more degrees of freedom. We can't take any degrees of freedom from the seat location analysis - they're already spoken for. So we take more degrees of freedom from the leftover that goes toward error.

For gender, where k equals 2, we would need 1 degree of freedom. And for the interaction, seat location X gender, we would multiply the seat location degrees of freedom by the gender degrees of freedom, so we need 2 more degrees of freedom to estimate that effect. Whatever is left goes in the error estimate. Sure, our leftover degrees of freedom is smaller than it was before we added the new variables, but the error variance is also probably smaller. We’re paying for it with degrees of freedom, but we’re also moving more variance from the error row to the systematic row.

This is part of the trade-off we have to make in analyzing data – trade-offs between simplicity and explaining as much variance as possible. In this regard, degrees of freedom can become a reminder of that trade-off in action: what you’re using to run your planned analysis.

It's all fun and games until someone runs out of degrees of freedom.