We're now 4 posts into Blogging A to Z, and I haven't really talked much about the assumptions of Rasch. That is, like any statistical test, there are certain assumptions you must meet in order for results to be valid. The same is true of Rasch. One of the key assumptions of Rasch is that the items all measure the same thing - the same latent variable. Put another way, your measure should be unidimensional: only assessing one dimension.
This is because, in order for the resulting item difficulties and person abilities to be valid - and comparable to each other - they have to be assessing the same thing. It wouldn't make sense to compare one person's math test score to a reading test score. And it wouldn't make sense to combine math and reading items into the same test; what exactly does the exam measure?
But the assumption of unidimensionality goes even further than that. You also want to be careful that each individual item is only measuring one thing. This is harder than it sounds. While a certain reading ability is needed for just about any test question, and some items will need to use jargon, a math item written at an unnecessarily high reading level is actually measuring two dimensions: math ability and reading ability. The same is true for poorly written items that give clues as to the correct answer or trick questions that trip people up even when they have the knowledge. Your test is not only of ability on the test topic, but a second ability: test savviness. The only thing that impacts whether a person gets an item correct should be their ability level in that domain. That is an assumption we have to make for a measurement to be valid. That is, if a person with high math ability gets an item incorrect because of low reading ability, we've violated that assumption. And if a person with low ability gets an item correct because there was "all of the above" option, we've violated that assumption.
How do we assess dimensionality? One way is with principal components analysis (PCA), which I've blogged about before. As a demonstration, I combined two separate measures (the Satisfaction with Life Scale and the Center for Epidemiologic Studies Depression measure) and ran them through a Rasch analysis as though they were a single measure. The results couldn't have been more perfect if I tried. The PCA results showed 2 clear factors - one made up of the 5 SWLS items and one made up of the 16 CESD items. Here's some of the output Winsteps gave me for the PCA. The first looks at explained variance and Eigenvalues:
There are two things to look at when running a PCA for a Rasch measure. 1. The variance explained by the measures should be at least 40%. 2. The Eigenvalue for the first contrast should be less than 2. As you can see, the Eigenvalue right next to the text "Unexplned variance in 1st contrast" is 7.76, while the values for the 2nd contrast (and on) are less than 2. That means we have two dimensions in the data - and those dimensions are separated out when we scroll down to look at the results of the first contrast. Here are those results in visual form - the letters in the plot refer to the items.
You're looking for two clusters of items, which we have in the upper left-hand corner and the lower right-hand corner. The next part identifies which items the letters refer to and gives the factor loadings:
The 5 items on the left side of the table are from the SWLS. The 16 on the right are all from the CESD. Like I said, perfect results.
But remember that PCA is entirely data-driven, so it could identify additional factors that don't have any meaning beyond that specific sample. So the PCA shouldn't be the only piece of evidence for unidimensionality, and there might be cases where you forgo that analysis altogether. If you've conducted a content validation study, that can be considered evidence that these items all belong on the same measure and assess the same thing. That's because content validation studies are often based on expert feedback, surveys of people who work in the topic area being assessed, job descriptions, and textbooks. Combining and cross-validating these different sources can be much stronger evidence than an analysis that is entirely data dependent.
But the other question is, how do we ensure all of our items are unidimensional? Have clear rules for how items should be written. Avoid "all of the above" or "none of the above" as answers - they're freebie points, because if they're an option, they're usually the right one. And if you just want to give away freebie points, why even have a test at all? Also make clear rules about what terms and jargon can be used in the test, and run readability analysis on your items - ignore those required terms and jargon and try to make the reading level of everything else as consistent across the exam as possible (and as low as you can get away with). When I was working on creating trait measures in past research, the guidelines from our IRB were usually that measures should be written at a 6th-8th grade reading level. And unless a measure is of reading level, that's probably a good guideline to try to follow.
Tomorrow, we'll talk about equating test items!
No comments:
Post a Comment