Deeply Trivial: I is for Item Fit

Rasch gives you lots of item-level data. Not only difficulties, but Rasch analysis will also produce fit indices, for both items and persons. Just like the log-likelihood chi-square statistic that tells you how well your data fit the Rasch model, you also receive item fit indices, which compare observed to expected (based on the Rasch model) responses. These indices are also based on chi-square statistics.

There are two types of fit indices: INFIT and OUTFIT.

OUTFIT is sensitive to Outliers. They are responses that fall outside of the targeted ability level, such as a high ability respondent missing an item targeted to their ability level, or a low ability respondent getting a difficult item correct. This could reflect a problem with the item - perhaps it's poorly worded and is throwing off people who actually know the information. Or perhaps there's a cue that is leading people to the correct answer who wouldn't otherwise get it right. These statistics can cue you in to problems with the item.

INFIT (Information weighted) is sensitive to responses that are too predictable. These items don't tell you anything you don't already know from other items. Every item should contribute to the estimate. More items is not necessarily better - this is one way Rasch differs from Classical Test Theory, where adding more items increases reliability. The more items you give a candidate, the greater your risk of fatigue, which will lead reliability (and validity) to go down. Every item should contribute meaningful, and unique, data. These statistics cue you in on items that might not be necessary.

The expected value for both of these statistics is 1.0. Any items that deviate from that value might be problematic. Linacre recommends a cut-off of 2.0, where any items that have an INFIT or OUTFIT of 2.0 or greater should be dropped from the measure. Test developers will sometimes adopt their own cut-off values, such as 1.5 or 1.7. If you have a large bank, you can probably afford to be more conservative and drop items above 1.5. If you're developing a brand new test or measure, you might want to be more lenient and use the 2.0 cut-off. Whatever you do, just be consistent and cite the literature whenever you can to support your selected cut-off.

Though this post is about item fit, these same statistics also exist for each person in your dataset. A misfitting person means the measure is not functioning the same for them as it does for others. This could mean the candidate got lazy and just responded at random. Or it could mean the measure isn't valid for them for some reason. (Or it could just be chance.) Many Rasch purists see no issue with dropping people who don't fit the model, but as I've discovered when writing up the results of Rasch analysis for publication, reviewers don't take kindly to dropping people unless you have other evidence to support it. (And since Rasch is still not a well-known approach, they mean evidence outside of Rasch analysis, like failing a manipulation check.)

The best approach I've seen is once again recommended by Linacre: persons with very high OUTFIT statistics are removed and ability estimates from the smaller sample are cross-plotted against the estimates from the full sample. If removal of these persons has little effect on the final estimates, these persons can be retained, because they don't appear to have any impact on the results. That is, they're not driving the results.

If there is a difference, Linacre recommends next examining persons with smaller (but still greater than 2.0) OUTFIT statistics and cross-plotting again. Though there is little guidance on how to define very high and high, in my research, I frequently use an OUTFIT of 3.0 for ‘very high’ and 2.0 for ‘high.’ In my experience, the results of such sensitivity analysis never shows any problem, and I'm able to justify keeping everyone in the sample. This seems to make both reviewers and Rasch purists happy.

Deeply Trivial

Wednesday, April 10, 2019

I is for Item Fit

No comments:

Post a Comment