I decided to pull out some data I collected for my Facebook study to demonstrate an analysis technique. I knew right away where the full dataset was stored, since I keep a copy in my backup online drive. This study used a long online survey, which was comprised of several published measures. I was going through identifying the variables associated with each measure, and was trying to take stock of which ones needed to be reverse-scored, as well as which ones also belonged to subscales.
I couldn't find that information in my backup folder, but I knew exactly which measures I used, so I downloaded the articles from which those measures were drawn. As I was going through one of the measures, I realized that I couldn't match up my variables with the items as listed. The variable names didn't easily match up and it looked like I had presented the items within the measure in a different order than they were listed in the article.
Why? I have no idea. I thought for a minute that past Sara was trolling me.
I went through the measure, trying to match up the variables, which I had named as an abbreviated version of the scale name followed by a "keyword" from the item text. But the keywords didn't always match up to any item in the list. Did I use synonyms? A different (newer) version of the measure? Was I drunk when I analyzed these data?
I frantically began digging through all of my computer folders, online folders, and email messages, desperate to find something that could shed light on my variables. Thank the statistical gods, I found a codebook I had created shortly after completing the study, back when I was much more organized (i.e., had more spare time). It's a simple codebook, but man, did it solve all of my dataset problems. Here's a screenshot of one of the pages:
Sadly, I'm not sure I'd take the time to do something like this now, which is a crime, because I could very easily run into this problem again - where I have no idea how/why I ordered my variables and no way to easily piece the original source material together. And as I've pointed out before, sometimes when I'm analyzing in a hurry, I don't keep well-labeled code showing how I computed different variables.
But all of this is very important to keep track of, and should go in a study codebook. At the very least, I would recommend keeping one copy of surveys that have annotations (source, scale/subscale, and whether reverse-coded - information you wouldn't want to be on the copy your participants see) and code/syntax for all analyses. Even if your annotations are a bunch of Word comment bubbles and your code/syntax is just a bunch of commands with no additional description, you'll be a lot better off than I was with only the raw data.
I recently learned there's an R package that will create a formatted codebook from your dataset. I'll do some research into that package and have a post about it, hopefully soon.
And I sincerely apologize to past Sara for thinking she was trolling me. Lucky for me, she won't read this post. Unless, of course, O'Reilly Auto Parts really starts selling this product.