Wednesday, March 14, 2018

Statistical Sins: Not Creating a Codebook

I'm currently preparing for Blogging A-to-Z. It's almost a month away, but I've picked a topic that will be fun but challenging, and I want to get as many posts written early as I can. I also have a busy April lined up, so writing posts during that month will be a challenge even if I had picked an easier topic.

I decided to pull out some data I collected for my Facebook study to demonstrate an analysis technique. I knew right away where the full dataset was stored, since I keep a copy in my backup online drive. This study used a long online survey, which was comprised of several published measures. I was going through identifying the variables associated with each measure, and was trying to take stock of which ones needed to be reverse-scored, as well as which ones also belonged to subscales.

I couldn't find that information in my backup folder, but I knew exactly which measures I used, so I downloaded the articles from which those measures were drawn. As I was going through one of the measures, I realized that I couldn't match up my variables with the items as listed. The variable names didn't easily match up and it looked like I had presented the items within the measure in a different order than they were listed in the article.

Why? I have no idea. I thought for a minute that past Sara was trolling me.

I went through the measure, trying to match up the variables, which I had named as an abbreviated version of the scale name followed by a "keyword" from the item text. But the keywords didn't always match up to any item in the list. Did I use synonyms? A different (newer) version of the measure? Was I drunk when I analyzed these data?

I frantically began digging through all of my computer folders, online folders, and email messages, desperate to find something that could shed light on my variables. Thank the statistical gods, I found a codebook I had created shortly after completing the study, back when I was much more organized (i.e., had more spare time). It's a simple codebook, but man, did it solve all of my dataset problems. Here's a screenshot of one of the pages:

As you can see, it's just a simple Word document with a table that gives Variable Name, the original text of the item, the rating scale used for that item, and finally what scale (and subscale) it belongs to and whether it should be reverse-scored (noted with "R," under subscale). This page displays items from the Ten-Item Personality Measure.

Sadly, I'm not sure I'd take the time to do something like this now, which is a crime, because I could very easily run into this problem again - where I have no idea how/why I ordered my variables and no way to easily piece the original source material together. And as I've pointed out before, sometimes when I'm analyzing in a hurry, I don't keep well-labeled code showing how I computed different variables.

But all of this is very important to keep track of, and should go in a study codebook. At the very least, I would recommend keeping one copy of surveys that have annotations (source, scale/subscale, and whether reverse-coded - information you wouldn't want to be on the copy your participants see) and code/syntax for all analyses. Even if your annotations are a bunch of Word comment bubbles and your code/syntax is just a bunch of commands with no additional description, you'll be a lot better off than I was with only the raw data.

I recently learned there's an R package that will create a formatted codebook from your dataset. I'll do some research into that package and have a post about it, hopefully soon.

And I sincerely apologize to past Sara for thinking she was trolling me. Lucky for me, she won't read this post. Unless, of course, O'Reilly Auto Parts really starts selling this product.


  1. What's the name of the R package in question?

  2. Yes, I would also be interested in that R package. We have been designing codebooks in our lab off of a good model, but it is certainly time-consuming up-front. It of course saves you a ton of time later, especially once you've spent some time away from a project/dataset.

  3. Yes, I should have included the name of the package - dataMaid. It was designed to help with data cleaning but generates reports that can also serve as a study codebook. I'm planning on writing a post soon detailing how to do just that. Thanks for reading!