Sunday, August 27, 2017

Statistics Sunday: Dealing with Missing Data

For my dissertation, participants read and completed a large packet. It included a voir dire questionnaire, abbreviated trial transcript, and post-trial questionnaire. Because I didn't have a grant or really any kind of externally contributed budget for the project, I copied and assembled the packets myself. To save paper (and money), I copied the materials two-sided. I put page numbers on the materials so that participants would (hopefully) notice the materials were front and back.

Sadly, not everyone did.

When I noticed after one of my sessions that people were not completing the back of the questionnaire, I added in arrows on the first page, to let them know there was material on the back. After that, the number of people skipping pages decreased, but still, some people would miss the back side of the pages.

Sometimes, despite your best efforts, you end up with missing data. Fortunately, there are things you can do about it.

What you can do about missing data depends in part on what kind of missingness we're talking about. There are three types of missing data:

Missing Completely at Random

In this case, missing information is not related to any other variables. It's rare to have this type of missing data - and that's actually okay, because there's not a lot you can do in this situation. Not only do you have missing data, there's no relationship between the data that is missing and the data that is not missing, meaning you can't use what data you have to fill in missing values. But you're also statistically justified in proceeding with what data you have. Your complete data is, in a sense, a random sample of all data from your group (which includes those missing values you didn't get to measure).

Missing at Random

‘Missing at random’ occurs when the missing information is related to observed variables. My dissertation data would fall in this category - at least, on the full pages that were skipped. This is because people were skipping those questions by accident, but since those questions were part of a questionnaire on a specific topic, the items are correlated with each other.

This means that I could use my complete data to fill in missing values. There are many methods for filling in missing values in this situation, though it should be kept in mind that any imputation method will artificially decrease variability. You want to use this approach sparingly. I shouldn't use it to fill in entire pages worth of questions, but could use it if a really important question or two was skipped. (By luck alone, all of the questions I had planned to include in analyses were on the front sides, and were as a result very rarely skipped.)

Missing Not at Random

The final situation occurs when the missing information is related to the missing values themselves or to another, unobserved variable. This is when people skip questions because they don't want to share their answer.

This is why I specified above that my data is only missing at random for those full pages. In those cases, people skipped the questions because they didn't realize they were there. But if I had a skipped question here and there (and I had a few), it could be because people didn't see it OR it could be because they don't want to share their answer. Without any data to justify one or the other, I have to assume it's the latter - if I'm being conservative, that is; lots of researchers with no data to justify it will assume data is missing at random and analyze away.

If I ask you about something very personal or controversial (or even illegal), you might skip that question. The people who do respond are generally the people with nothing to hide. They're going to be qualitatively different from people who don't want to share their answer. Methods to replace missing values will not be very accurate in this situation. The only thing you can do here is to try to prevent missing data from the beginning, such as with language in the consent document about how participants' data will be protected. If you can make the study completely anonymous (so that you don't even know who participated) that would be best. When that's not possible, you need strong assurances of confidentiality.

How Do You Solve a Problem Like Missing Data?

First off, you can solve your missing data problems with imputation methods. Some are better than others, but I generally don't recommend these approaches because, as I said above, they artificially decrease variance. The simplest imputation method is mean replacement - you replace each missing value with the mean derived from non-missing values on that variable. This is based on the idea that "the expected value is the mean"; in fact, it's the most literal interpretation of that aspect of statistical inference. 

Another method, which is a more nuanced interpretation of "the expected value is the mean" is to use linear regression to predict scores on the variable with missingness using one or more variables with more complete data. So you conduct the analysis with people who have complete data, then use the regression equation you derived from those participants to predict what the score will be for someone with incomplete data. But regression is still built on means - it's just a more complex combination of means. Regression coefficients are simply the effect of one variable on another averaged across all participants. And outcomes are simply the mean of the y variable for people with a specific combination of scores on the x variables. Fortunately, in this case, you aren't using a one-size-fits-all approach, and you're introducing some variability into your imputed scores. But you're still artificially controlling your variance by, in a sense, creating a copy of another participant.

Of course, you're better off using an analysis approach that can handle missing data. Some analyses can be set up to remove people with missing data "pairwise." This means that for a portion of analysis using two variables, the program uses anyone with complete data on those two variables. People are not removed completely if they have missing data; they're just only included in the parts of the analysis for which they have complete data and dropped from parts of the analysis where they don't. This will work for simpler analyses like correlations - it just means that your correlation matrix will be based on a varying number of people, depending on which specific pair of variables you're referring to.

More complex, iterative analyses can also handle some missing data, by changing which estimation method it uses. (This is a more advanced concept, but I'm planning on writing about some of the estimation methods in the future - stay tuned!) Structural equation modeling analyses, for instance, can handle missing data, as long as the proportion of missing data in the dataset doesn't get too high.

And if you can use psychometric techniques with your data - that is, if your data examines measures of a latent variable - you're in luck, because my favorite psychometric technique, Rasch, can handle missing data beautifully. (To be fair, item response theory models can as well.) In fact, the assumption in many applications of the Rasch model is that you're going to have missing data, because it's often used on adaptive tests - adaptive meaning people are going to respond to different combinations of questions depending on their ability. 

I have a series of posts planned on Rasch, so I'll revisit this idea about missing data and adaptive tests later on. And I'm working on an article on how to determine if Rasch is right for you. The journal I'm shooting for is (I believe) open access, but I'm happy to share the article, even in draft form, to anyone who wants it. Just leave a comment below and I'll follow-up with you on how to share it.


  1. Nice post. SEM and IRT models work under MAR for the same reason: They're both from a general class of models (generalized SEMs) that allow for full information maximum likelihood estimation. Missingness in exogenous variables is, however, much more problematic in an SEM, and they do not cope with MNAR situations. I'd highly recommend taking a look at the recent work by Craig Enders (, who's synthesized a lot of the missing data literature and provided some new programs to do techniques like multiple imputation.

  2. Thanks, Jay, for sharing the work by Craig Enders! I'll definitely look into it. And look for a post (or perhaps a handful of posts) about estimation techniques in the future. My goal with this blog is to make these topics approachable to non-statisticians, so the struggle is in translating these topics into plain language. I hope to get to estimation techniques soon!