Wednesday, April 3, 2019

C is for Category Function

Up to now, I’ve been talking mostly about Rasch with correct/incorrect or yes/no data. But Rasch can also be used with measures using rating scales or where multiple points can be awarded for an answer. If all of your items have the same scale – that is, they all use a Likert scale of Strongly Disagree to Strongly Agree or they’re all worth 5 points – you can use the Rasch Rating Scale Model.

Note: If your items have differing scales, you could use a Partial Credit Model, which fits each item separately, or if you have sets of items worth the same number of points, you could use a Grouped Rating Scale model, which is sort of a hybrid of the RSM and PCM. I’ll try to touch on these topics later.

Again, in Rasch, every item has a difficulty and every person an ability. But for items worth multiple points or with rating scales, there’s a third thing, which is on the same scale as item difficulty and person ability – the difficulty level for each point on the rating scale. How much ability does a person need to earn all 5 points on a math item? (Or 4 points? Or 3 points? …) How much of the trait is needed to select “Strongly Agree” on a satisfaction with life item? (Or Agree? Neutral? ...) Each point on the scale is given a difficulty. When you examine these values, you’re looking at Category Function.

When you look at these category difficulties, you want to examine two things. First, you want to make certain that higher points on the scale require more ability or more of the trait. Your category difficulties should stairstep up. When your scale points do this, we say the scale is “monotonic” (or “proceeds monotonically”).

Let’s start by looking at a scale that does not proceed monotonically, where the category difficulties are disordered. There are two types of category function data you’ll look at. The first is the “observed measure,” which is the average ability of the people who selected that category. The second are category thresholds – how much more of the trait is needed to select that particular category. When I did my Facebook study, I used the Facebook Questionnaire (Ross et al., 2009), which is a 4-item measure assessing intensity of use and attitudes toward Facebook. All 4 items use a 7-point scale from Strongly Disagree to Strongly Agree. Just for fun, I decided to run this measure through a Rasch analysis in Winsteps, and see how the categories function. Specifically, I looked at the thresholds. (I also looked the observed measures, but they were monotonic, which is good. But the thresholds were not, which can happen, where one looks good and the other looks bad.) Because these are thresholds between categories, there isn’t one for the first category, Strongly Disagree. But there is one for each category after that, which reflects how much ability or the trait they need to be more likely to select that category than the one below it. Here’s what those look like for the Facebook Questionnaire.

The threshold for the neutral category is lower than for slightly disagree. People are not using that category as I intended them to – perhaps they’re using it when they generally have no opinion, for instance, rather than when they’re caught directly between agreement and disagreement. If I were developing this measure, I might question whether to drop this category, or perhaps find a better descriptor for it. Regardless, I would probably collapse this category into another one (which I usually determine based on frequencies), or possibly drop it, and rerun my analysis with a new 6-point scale to see if category function improves.

The second thing you want to look for is a good spread on those thresholds; you want them to be at least a certain number of logits apart. When you have more options on a rating scale, this adds additional cognitive effort to answer the question. So you want to make sure that each additional point on the rating scale actually gives you useful information – information that allows you to differentiate between people at one point on the ability scale and others. If two categories have basically the same threshold, it means people are having trouble differentiating the two; maybe they’re having trouble parsing the difference between “much of the time” and “most of the time,” leading people of approximately the same ability level to select these two categories about equally.

I’ve heard different guidelines on how big a “spread” is needed. Linacre, who created Winsteps, recommends 1.4 logits, and recommends collapsing categories until you’re able to attain this spread. That’s not always possible. I’ve also heard smaller, such as 0.5 logits. But either way, you definitely don’t want two categories to have the exact same observed measure or category threshold.

Also as part of the Facebook study, I administered the 5-item Satisfaction with Life Scale (Diener et al., 1985). Like the Facebook Questionnaire, this measure uses a 7-point scale (Strongly Disagree to Strongly Agree).

The middle categories are all closer together, and certainly don’t meet Linacre’s 1.4 logits guideline. I’m not as concerned about that, but I am concerned that Neither Agree nor Disagree and Slightly Agree are so close together. Just like above, where the category thresholds didn’t advance, there might be some confusion about what this “neutral” category really means. Perhaps this measure doesn’t need a 7-point scale. Perhaps it doesn’t need a neutral option. These are some issues to explore with the measure.

As a quick note, I don’t want it to appear I’m criticizing either measure. They were not developed with Rasch and this idea of category function is a Rasch-specific one. It might not be as important for these measures. But if you’re using the Rasch approach to measurement, these are ideas you need to consider. And clearly, these category function statistics can tell you a lot about whether there seems to be confusion about how a point on a rating scale is used or what it means. If you’re developing a scale, it can help you figure out what categories to combine or even drop.

Tomorrow’s post – dimensionality!


Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction with Life Scale. Journal of Personality Assessment, 49, 71-75.

Ross, C., Orr, E. S., Sisic, M., Arseneault, J. M., Simmering, M. G., & Orr, R. R. (2009). Personality and motivations associated with Facebook use. Computers in Human Behavior, 25, 578-586.


  1. Thanks Sara, have been followong this series with interest. i found the threshold tables a bit hard to interpret in this post. Can you give an example for what "good" thresholds would look like in each?


  2. If you've decided that you want a favor that's both unique and practical, you might consider heart shaped measuring spoons. measuring cup