Thursday, January 18, 2018

Statistical Sins: Data Dictionaries and Variable Naming Conventions

Before I started at DANB, the group fielded a large survey involving thousands of participants from throughout the dental profession. My job the last few weeks has been to dig through this enormous dataset, testing some planned hypotheses.

Because the preferred statistical program with my coworkers is SPSS, the data were given to me in an SPSS file. The nice thing about this is that one can easily add descriptive text for each variable, predefine missing values, and label factors. But this can also be a drawback, when the descriptive text is far too long and used to make up for nonintuitive variable names. As is the case with this dataset.

That is, in this dataset, the descriptive text is simply the full item text from the survey, copied and pasted, making for some messy output. Even worse, when the data were pulled into SPSS, each variable was named Q followed by a number. Unfortunately, there are many variables in here that don't align to questions, but they were still named in order. This makes the Q-numbering scheme meaningless. Responses for question 3 in the survey are in the variable, Q5, for instance. Unless you're using descriptive variable names (e.g., data from the question about gender is called "gender"), numbering schemes become unwieldy unless they can be linked to something, such as item number on a survey. It's tempting to skip the step of naming each variable when working with extremely large datasets, but it's when datasets are large that intuitive naming conventions are even more necessary.

I'm on a tight schedule - hence this rushed blog post - so I need to push forward with analysis. (I'm wondering if that's what happened with the last person, too, which would explain the haphazard nature of the dataset.) But I'm seriously considering stopping analysis so I can pull together a clear data dictionary with variable name, shorter descriptive text, and in sample order instead than overall survey order. There are also a bunch of new items the previous analyst generated that don't look all that useful for me and make the dataset even more difficult to work with. At the very least, I'm probably going to pull together an abbreviated dataset that removes these vestigial variables.

No comments:

Post a Comment