Deeply Trivial: A is for Ability

Welcome to April A to Z, where I'll go through the A to Z of Rasch! This is the measurement model I use most frequently at work, and is a great way to develop a variety of measures. It started off in educational measurement, and many Rasch departments are housed in educational psychology and statistics departments. But Rasch is slowly making its way into other disciplines as well, and I hope to see more people using this measurement model. It provides some very powerful and useful statistics on measures and isn't really that difficult to learn.

A few disclaimers:

There are entire courses on Rasch measurement, and in fact, some programs divide Rasch up across several courses because there's much that can be learned on the topic. This blog series is really an introduction to the concepts, to help you get started and decide if Rasch is right for you. I won't get into the math behind it as much, but will try to use some data examples to demonstrate the concepts.
I try to be very careful about what data I present on the blog. My data examples are usually: data I personally own (that is, collected as part of my masters or doctoral study), publicly available data, or (most often) simulated data. None of the data I present will come from any of the exams I work on at my current position or past positions as a psychometrician. I don't own those data and can't share them.
Finally, these posts are part of a series in which I build on past posts to introduce new concepts. I won't always be able to say everything I want on a topic for a given post, because it ties into another post better. So keep reading - I'll try to make it clear when I'll get back to something later.

For my first post in the series, ability!

When we use measurement with people, we're trying to learn more about some underlying quality about them. We call this underlying quality we want to measure a "latent variable." It's something that can't be captured directly, but rather through proxies. In psychometrics, those proxies are the items. We go through multiple steps to design and test those items to ensure they get as close as we can to tapping into that latent variable. In science/research terms, the items are how we operationalize the latent variable: define it in a way that it can be measured.

In the Rasch approach to psychometrics, we tend to use the term "ability" rather than "latent variable." First, Rasch was originally designed for educational assessment - tests that had clear right and wrong answers - so it makes sense to frame performance on these tests as ability. Second, Rasch deals with probabilities that a person is able to answer a certain question correctly or endorse a certain answer on a test. So even for measures of traits, like attitudes and beliefs or personality inventories, people with more of a trait have a different ability to respond to different questions. Certain answers are easier for them to give because of that underlying trait.

In Rasch, we calibrate items in terms of difficulty (either how hard the question is or, for trait measures, how much of the trait is needed to respond in a certain way) and people in terms of ability. These two concepts are calibrated on the same scale, so once we have a person's ability, we can immediately determine how they're likely to respond to a given item. That scale is in a unit called a "logit," or a log odds ratio. This conversion gives us a distribution that is nearly linear. (Check out the graphs in the linked log odds ratio post to see what I mean.) Typically, when you look at your distribution of person ability measures, you'll see numbers ranging from positive to negative. And while, theoretically, your logits can range from negative infinity to positive infinity, more likely, you'll see values from -2 to +2 or something like that.

That ability metric is calculated based on the items the person responded to (and especially how they responded) - for exams with right and wrong answers, their ability is the difficulty of item at which they have a 50% chance of responding correctly. The actual analysis involved in creating these ability estimates uses some form of maximum likelihood estimation (there are different types, like Joint Maximum Likelihood Estimation, JMLE, or Pairwise Maximum Likelihood Estimation, PMLE, to name a couple), so it goes back and forth with different values until it gets estimates that best fit the data. This is both a strength and weakness of Rasch: it's an incredibly powerful analysis technique that makes full use of the data available, and handles missing data beautifully, but it couldn't possibly be done without a computer and capable software. In fact, Rasch has been around almost as long as classical test theory approaches - it just couldn't really be adopted until technology caught up.

I'll come back to this concept later this month, but before I close on this post, I want to talk about one more thing: scores. When you administer a Rasch exam and compute person ability scores, you have logits that are very useful for psychometric purposes. But rarely would you ever show those logit scores to someone else, especially an examinee. A score of -2 on a licensing exam or the SAT isn't going to make a lot of sense to an examinee. So we convert those values to a scaled score, using some form of linear equation. That precise equation differs by exam. The SAT, for instance, ranges from 400 to 1600, the ACT from 1 to 36, and the CPA exam from 0 to 99. Each of these organizations has an equation that takes their ability scores and converts them to their scaled score metric. (And in case you're wondering if there's a way to back convert your scaled score on one of these exams to the logit, you'd need to know the slope and constant of the equation they use - not something they're willing to share, nor something you could determine from a single data point. They don't really want you looking under the hood on these types of things.)

Some questions you may have:

The linear equation mentioned above takes into account the possible range of scaled scores (such as 400 to 1600 on the SAT), as well as the actual abilities of the pilot or pretest sample you used. What if you administer the test (with the set difficulties of the items from pretesting), and you get someone outside of the range of abilities of the sample you used to create the equation? Could you get someone with a score below 400 or above 1600? Yes, you could. This is part of why you want a broad range of abilities in your pilot sample, to make sure you're getting all possible outcomes and get the most accurate equation you can. This is also where simulated data could be used, which we do use in psychometrics. Once you have those item difficulties set with pilot testing, you could then create cases that get, for instance, all questions wrong or all questions right. This would give you the lowest possible logit and the highest possible logit. You can then include those values when setting the equation. As long as the difficulties are based on real performance, it's okay to use simulated cases. Finally, that range of possible scores sets the minimum and maximum for reporting. Even if an ability score means a person could have an SAT below 400, their score will be automatically set at the minimum.
What about passing and failing for licensing and credentialing exams? How do ability scores figure into that? Shouldn't you just use questions people should know and forget about all this item difficulty and person ability nonsense? This is one way that Rasch and similar measurement models differ from other test approaches. The purpose of a Rasch test is to measure the person's ability, so you need a good range of item difficulties to determine what a person's true ability is. These values are not intended to tell you whether a person should pass or fail - that's a separate part of the psychometric process called standard setting, where a committee of subject matter experts determines how much ability is necessary to say a person has the required knowledge to be licensed or credentialed. This might be why Rasch makes more sense to people when you talk about it in the educational assessment sense: you care about what the person's ability is. But Rasch is just as useful in pass/fail exams, because of the wealth of information it gives you about your exam and items. You just need that extra step of setting a standard to give pass/fail information. And even in those educational assessments, there are generally standards, such as what a person's ability should be at their age or grade level. Rasch measurement gives you a range of abilities in your sample, which you can then use to make determinations about what those abilities should be.

Tomorrow, we'll dig more into item difficulties when we discuss item banks!

Deeply Trivial

Monday, April 1, 2019

A is for Ability

No comments:

Post a Comment