Deeply Trivial: April 2019

Monday, April 15, 2019

J is for Journal of Applied Measurement (and Other Resources)

As with many fields and subfields, Rasch has its own journal - the Journal of Applied Measurement, which publishes a variety of articles either using or describing how to use Rasch measurement. You can read the table of contents for JAM going back to its inaugural issue here.

But JAM isn't the only resource available for Rasch users. First off, JAM Press publishes multiple books on different aspects of Rasch measurement.

But the most useful resource by far is Rasch Measurement Transactions, which goes back to 1987 and is freely available. These are shorter articles dealing with hands on topics. If I don't know how to do something regarding Rasch, this is a great place to check. And you can always find those articles on a certain topic via Google, by setting the site to search as "rasch.org".

Finally, there is a Rasch message board, which is still active, where you can post questions (and answer them if you feel so inclined!).

As you can see, I'm a bit behind on A to Z posts. I'll be playing catch up this week!

Wednesday, April 10, 2019

I is for Item Fit

Rasch gives you lots of item-level data. Not only difficulties, but Rasch analysis will also produce fit indices, for both items and persons. Just like the log-likelihood chi-square statistic that tells you how well your data fit the Rasch model, you also receive item fit indices, which compare observed to expected (based on the Rasch model) responses. These indices are also based on chi-square statistics.

There are two types of fit indices: INFIT and OUTFIT.

OUTFIT is sensitive to Outliers. They are responses that fall outside of the targeted ability level, such as a high ability respondent missing an item targeted to their ability level, or a low ability respondent getting a difficult item correct. This could reflect a problem with the item - perhaps it's poorly worded and is throwing off people who actually know the information. Or perhaps there's a cue that is leading people to the correct answer who wouldn't otherwise get it right. These statistics can cue you in to problems with the item.

INFIT (Information weighted) is sensitive to responses that are too predictable. These items don't tell you anything you don't already know from other items. Every item should contribute to the estimate. More items is not necessarily better - this is one way Rasch differs from Classical Test Theory, where adding more items increases reliability. The more items you give a candidate, the greater your risk of fatigue, which will lead reliability (and validity) to go down. Every item should contribute meaningful, and unique, data. These statistics cue you in on items that might not be necessary.

The expected value for both of these statistics is 1.0. Any items that deviate from that value might be problematic. Linacre recommends a cut-off of 2.0, where any items that have an INFIT or OUTFIT of 2.0 or greater should be dropped from the measure. Test developers will sometimes adopt their own cut-off values, such as 1.5 or 1.7. If you have a large bank, you can probably afford to be more conservative and drop items above 1.5. If you're developing a brand new test or measure, you might want to be more lenient and use the 2.0 cut-off. Whatever you do, just be consistent and cite the literature whenever you can to support your selected cut-off.

Though this post is about item fit, these same statistics also exist for each person in your dataset. A misfitting person means the measure is not functioning the same for them as it does for others. This could mean the candidate got lazy and just responded at random. Or it could mean the measure isn't valid for them for some reason. (Or it could just be chance.) Many Rasch purists see no issue with dropping people who don't fit the model, but as I've discovered when writing up the results of Rasch analysis for publication, reviewers don't take kindly to dropping people unless you have other evidence to support it. (And since Rasch is still not a well-known approach, they mean evidence outside of Rasch analysis, like failing a manipulation check.)

The best approach I've seen is once again recommended by Linacre: persons with very high OUTFIT statistics are removed and ability estimates from the smaller sample are cross-plotted against the estimates from the full sample. If removal of these persons has little effect on the final estimates, these persons can be retained, because they don't appear to have any impact on the results. That is, they're not driving the results.

If there is a difference, Linacre recommends next examining persons with smaller (but still greater than 2.0) OUTFIT statistics and cross-plotting again. Though there is little guidance on how to define very high and high, in my research, I frequently use an OUTFIT of 3.0 for ‘very high’ and 2.0 for ‘high.’ In my experience, the results of such sensitivity analysis never shows any problem, and I'm able to justify keeping everyone in the sample. This seems to make both reviewers and Rasch purists happy.

Tuesday, April 9, 2019

H is for How to Set Up Your Data File

The exact way you set up your data of course depends on the exact software you use. But my focus today is to give things to think about if/when setting up your data for Rasch analysis.

First, know how your software needs you to format missing values. Many programs will let you simply leave a blank space or cell. Winsteps is fine with a blank space to notate a missing value or skipped question. Facets, on the other hand, will flip out at a blank space and needs a missing value set up (usually I use 9).

Second, ordering of the file is very important, especially if you're working with data from a computer adaptive test, meaning missing values is also important. When someone takes a computer adaptive test, their first item is drawn at random from a set of moderately difficult items. The difficulty of the next item depends on how they did on the first item, but even so, the item is randomly drawn from a set or range of items. So when you set up your data file, you need to be certain that all people who responded to a specific item have that response in the same column (not necessarily where the item was administered numerically in the exam).

This why you need to be meticulously organized with your item bank and give each item an identifier. When you assemble responses for computer adaptive tests, you'll need to reorder people's responses. That is, you'll set up an order for every item in the bank by identifier. When data are compiled, their responses are put in that order, and if a particular item in the bank wasn't administered, there would be a space or missing value there.

Third, be sure you differentiate between item variables and other variables, like person identifiers, demographics, and so on. Once again, know your software. You may find that a piece of software just runs an entire dataset as though all variables are items, meaning you'll get weird results if you have a demographic variable mixed in. Others might let you select certain variables for the analysis and/or categorize variables as items and non-items.

I tend to keep a version of my item set in Excel, with a single variable at the beginning with participant ID number. Excel is really easy to import into most software, and I can simply delete the first column if a particular program doesn't allow non-item variables. If I drop any items (which I'll talk more about tomorrow), I do it from this dataset. A larger dataset, with all items, demographic variables, and so on is kept usually in SPSS, since that's the preferred software at my company (I'm primarily an R user, but I'm the only one and R can read SPSS files directly) in case I ever need to pull in any additional variables for group comparisons. This dataset is essentially the master and any smaller files I need are built from it.

Monday, April 8, 2019

G is for Global Fit Statistics

One of the challenges of Blogging A to Z is that the posts have to go in alphabetical order, even if it would make sense to start with a topic from the middle of the alphabet. It's more like creating the pieces of a puzzle, that could (and maybe should) be put together in a different order than they were created. But like I said, it's a challenge!

So now that we're on letter G, it's time to talk about a topic that I probably would have started with otherwise: what exactly is the Rasch measurement model? Yes, it is a model that allows you to order people according to their abilities (see letter A) and items according to their difficulties. But more that that, it's a prescriptive model of measurement - it contains within it the mathematical properties of a good measure (or at least, one definition of a good measure). This is how Rasch differs from Item Response Theory (IRT) models, though Rasch is often grouped into IRT despite its differences. You see, mathematically, Rasch is not very different from the IRT 1-parameter model, which focuses on item difficulty (and by extension, person ability). But philosophically, it is very different, because while Rasch is prescriptive, IRT is descriptive. If an IRT 1-parameter model doesn't adequately describe the data, you could just select a different IRT model. But Rasch says that the data must fit its model, and it gives you statistics to tell you how well it does. If your data don't fit the model, the deficiency is with the data (and your measure), not the model itself.

Note: This is outside of the scope of this blog series, but in IRT, the second parameter is item discrimination (how well the item differentiates between high and low ability candidates) and the third is the pseudo-guessing parameter (the likelihood you'd get an answer correct based on chance alone). The Rasch model assumes that the item discrimination for all items is 1.0 and does not make any corrections for potential guessing. You know how the SAT penalizes you for wrong answers? It's to discourage guessing. They don't want to you answering a question if you don't know the answer; a lucky guess is not a valid measure. What can I say, man? We psychometricians are a**holes.

When you use Rasch, you're saying that you have a good measure when it gives you data that fits the Rasch model. Poor fit to the Rasch model means you need to rework your measure - perhaps dropping items, collapsing response scales, or noting inconsistencies in person scores that mean their data might not be valid (and could be dropped from the analysis).

For Blogging A to Z 2017, I went through the alphabet of statistics, and for the letter G, I talked about goodness of fit. In Rasch, we look at our global fit statistics to see how well our data fit the prescribed Rasch model. If our data don't fit, we start looking at why and retooling our measure so it does.

The primary global fit statistic we should look at is the log-likelihood chi square statistic, which, as the name implies, is based on the chi square distribution. A significant chi-square statistic in this case means the data significantly differs from the model. Just like in structural equation model, it is a measure of absolute fit.

There are other fit statistics you can look at, such as the Akaike Information Criterion (AIC) and Schwarz Bayesian Information Criterion (BIC). These statistics are used for model comparison (relative fit), where you might test out different Rasch approaches to see what best describes the data (such as a Rating Scale Model versus a Partial Credit Model) or see if changes to the measure (like dropping items) results in better fit. These values are derived from the log-likehood statistic and either degrees of freedom for the AIC or number of non-extreme cases (in Rasch, extreme cases would be those that got every item right or every item wrong) for the BIC. (You can find details and the formulas for AIC and BIC here.)

BIC seems to be the preferred metric here, since it accounts for extreme cases; a measure with lots of extreme cases is not as informative as a measure with few extreme cases, so this metric can help you determine if dropping too easy or too difficult items improves your measure (it probably will, but this lets you quantify that).

Tomorrow, I'll talk about setting up your data for Rasch analysis.

Sunday, April 7, 2019

F is for Facets

So far this month, I've talked about the different things that affect the outcome of measurement - in that it determines how someone will respond. Those things so far would be item difficulty and person ability. How much ability a person has or how much of the trait they possess will affect how they respond to items of varying ability. Each of things that interacts to affect the outcome of Rasch measurement is called a "facet."

But these don't have to be the only facets in a Rasch analysis. You can have additional facets, making for a more complex model. In our content validation studies, we administer a job analysis survey asking people to rate different job-related tasks. As is becoming standard in this industry, we use two different scales for each item, one rating how frequently this task is performed (so we can place more weight on the more frequently performed items) and one rating how critical it is to perform this task competently to protect the public (so we can place more weight on the highly critical items). In this model, we can differentiate between the two scales and see how the scale used changes how people respond. This means that rating scale also becomes a facet, one with two levels: frequency scale and criticality scale.

When conducting a more complex model like this, we need software that can handle these complexities. The people who brought us Winsteps, the software I use primarily for Rasch analysis, also have a program called Facets, which can handle these more complex models.

In a previous blog post, I talked about a facets model I was working with, one with four facets: person ability, item difficulty, rating scale, and timing (whether they received frequency first or second). But one could use a facets model for other types of data, like judge rating data. The great thing about using facets to examine judge data is that one can also partial out concepts like judge leniency; that is, some judges go "easier" on people than others, and a facets models lets you model that leniency. You would just need to have your judges rate more than one person in the set and have some overlap with other judges, similar to the overlap I introduced in the equating post.

This is the thing I love about Rasch measurement, in that it is a unique approach to measurement that can expand in complexity to whatever measurement situation you're presented with. It's all based on the Rasch measurement model, a mathematical model that represents the characteristics a "good measure" should possess - that's what we'll talk about tomorrow when we examine global fit statistics!

Friday, April 5, 2019

E is for Equating

In the course of any exam, new items and even new forms have to be written. The knowledge base changes and new topics become essential. How do we make sure these new items contribute to the overall test score? Through a process called equating.

Typically in Rasch, we equate test items by sprinkling new items in with old items. When we run our Rasch analysis on items and people, we "anchor" the old items to their already established item difficulties. When item difficulties are calculated for the new items, they are now on the exact same metric as the previous items, and new difficulties are established relative to the ones that have already been set through pretesting.

It's not even necessary for everyone to receive all pretest items - or even all of the old items. You just need enough overlap to create links between old and new items. In fact, when you run data from a computer adaptive test, there are a lot "holes" in the data, creating a sparse matrix.

In the example above, few examinees received the exact same combination of items, but with an entire dataset that looks like this (and more examinees, of course), we could estimate item difficulties for new items in the set.

But you may also ask about the "old" items - do we always stick with the same difficulties or do we change those from time to time? Whenever you anchor items to existing item difficulties, the program will still estimate fresh item difficulties and let you examine something called "displacement": how much the newly estimated difficulty differs from the anchored one. You want to look at these and make sure you're not experiencing what's called "item drift," which happens when an item becomes easier or harder over time.

This definitely happens. Knowledge that might have previously been considered difficult can become easier over time. Math is a great example. Many of the math classes my mom took in high school and college (and that may have been electives) were offered to me in middle school or junior high (and were mandatory). Advances in teaching, as well as better understanding of how these concepts work, can make certain concepts easier.

On the flipside, some items could get more difficult over time. My mom was required to take Latin in her college studies. I know a little "church Latin" but little else. Items about Latin would have been easier for my mom's generation, when the language was still taught, than mine. And if we could hop in a time machine and go back to when scientific writing was in Latin (and therefore, to be a scientist, you had to speak/write Latin fluently), these items would be even easier for people.

Essentially, even though Rasch offers an objective method of measurement and a way to quantify item difficulty, difficulty is still a relative concept and can change over time. Equating is just one part of the type of testing you must regularly do with your items to maintain a test or measure.

Thursday, April 4, 2019

D is for Dimensionality

We're now 4 posts into Blogging A to Z, and I haven't really talked much about the assumptions of Rasch. That is, like any statistical test, there are certain assumptions you must meet in order for results to be valid. The same is true of Rasch. One of the key assumptions of Rasch is that the items all measure the same thing - the same latent variable. Put another way, your measure should be unidimensional: only assessing one dimension.

This is because, in order for the resulting item difficulties and person abilities to be valid - and comparable to each other - they have to be assessing the same thing. It wouldn't make sense to compare one person's math test score to a reading test score. And it wouldn't make sense to combine math and reading items into the same test; what exactly does the exam measure?

But the assumption of unidimensionality goes even further than that. You also want to be careful that each individual item is only measuring one thing. This is harder than it sounds. While a certain reading ability is needed for just about any test question, and some items will need to use jargon, a math item written at an unnecessarily high reading level is actually measuring two dimensions: math ability and reading ability. The same is true for poorly written items that give clues as to the correct answer or trick questions that trip people up even when they have the knowledge. Your test is not only of ability on the test topic, but a second ability: test savviness. The only thing that impacts whether a person gets an item correct should be their ability level in that domain. That is an assumption we have to make for a measurement to be valid. That is, if a person with high math ability gets an item incorrect because of low reading ability, we've violated that assumption. And if a person with low ability gets an item correct because there was "all of the above" option, we've violated that assumption.

How do we assess dimensionality? One way is with principal components analysis (PCA), which I've blogged about before. As a demonstration, I combined two separate measures (the Satisfaction with Life Scale and the Center for Epidemiologic Studies Depression measure) and ran them through a Rasch analysis as though they were a single measure. The results couldn't have been more perfect if I tried. The PCA results showed 2 clear factors - one made up of the 5 SWLS items and one made up of the 16 CESD items. Here's some of the output Winsteps gave me for the PCA. The first looks at explained variance and Eigenvalues:

There are two things to look at when running a PCA for a Rasch measure. 1. The variance explained by the measures should be at least 40%. 2. The Eigenvalue for the first contrast should be less than 2. As you can see, the Eigenvalue right next to the text "Unexplned variance in 1st contrast" is 7.76, while the values for the 2nd contrast (and on) are less than 2. That means we have two dimensions in the data - and those dimensions are separated out when we scroll down to look at the results of the first contrast. Here are those results in visual form - the letters in the plot refer to the items.

You're looking for two clusters of items, which we have in the upper left-hand corner and the lower right-hand corner. The next part identifies which items the letters refer to and gives the factor loadings:

The 5 items on the left side of the table are from the SWLS. The 16 on the right are all from the CESD. Like I said, perfect results.

But remember that PCA is entirely data-driven, so it could identify additional factors that don't have any meaning beyond that specific sample. So the PCA shouldn't be the only piece of evidence for unidimensionality, and there might be cases where you forgo that analysis altogether. If you've conducted a content validation study, that can be considered evidence that these items all belong on the same measure and assess the same thing. That's because content validation studies are often based on expert feedback, surveys of people who work in the topic area being assessed, job descriptions, and textbooks. Combining and cross-validating these different sources can be much stronger evidence than an analysis that is entirely data dependent.

But the other question is, how do we ensure all of our items are unidimensional? Have clear rules for how items should be written. Avoid "all of the above" or "none of the above" as answers - they're freebie points, because if they're an option, they're usually the right one. And if you just want to give away freebie points, why even have a test at all? Also make clear rules about what terms and jargon can be used in the test, and run readability analysis on your items - ignore those required terms and jargon and try to make the reading level of everything else as consistent across the exam as possible (and as low as you can get away with). When I was working on creating trait measures in past research, the guidelines from our IRB were usually that measures should be written at a 6th-8th grade reading level. And unless a measure is of reading level, that's probably a good guideline to try to follow.

Tomorrow, we'll talk about equating test items!

Wednesday, April 3, 2019

C is for Category Function

Up to now, I’ve been talking mostly about Rasch with correct/incorrect or yes/no data. But Rasch can also be used with measures using rating scales or where multiple points can be awarded for an answer. If all of your items have the same scale – that is, they all use a Likert scale of Strongly Disagree to Strongly Agree or they’re all worth 5 points – you can use the Rasch Rating Scale Model.

Note: If your items have differing scales, you could use a Partial Credit Model, which fits each item separately, or if you have sets of items worth the same number of points, you could use a Grouped Rating Scale model, which is sort of a hybrid of the RSM and PCM. I’ll try to touch on these topics later.

Again, in Rasch, every item has a difficulty and every person an ability. But for items worth multiple points or with rating scales, there’s a third thing, which is on the same scale as item difficulty and person ability – the difficulty level for each point on the rating scale. How much ability does a person need to earn all 5 points on a math item? (Or 4 points? Or 3 points? …) How much of the trait is needed to select “Strongly Agree” on a satisfaction with life item? (Or Agree? Neutral? ...) Each point on the scale is given a difficulty. When you examine these values, you’re looking at Category Function.

When you look at these category difficulties, you want to examine two things. First, you want to make certain that higher points on the scale require more ability or more of the trait. Your category difficulties should stairstep up. When your scale points do this, we say the scale is “monotonic” (or “proceeds monotonically”).

Let’s start by looking at a scale that does not proceed monotonically, where the category difficulties are disordered. There are two types of category function data you’ll look at. The first is the “observed measure,” which is the average ability of the people who selected that category. The second are category thresholds – how much more of the trait is needed to select that particular category. When I did my Facebook study, I used the Facebook Questionnaire (Ross et al., 2009), which is a 4-item measure assessing intensity of use and attitudes toward Facebook. All 4 items use a 7-point scale from Strongly Disagree to Strongly Agree. Just for fun, I decided to run this measure through a Rasch analysis in Winsteps, and see how the categories function. Specifically, I looked at the thresholds. (I also looked the observed measures, but they were monotonic, which is good. But the thresholds were not, which can happen, where one looks good and the other looks bad.) Because these are thresholds between categories, there isn’t one for the first category, Strongly Disagree. But there is one for each category after that, which reflects how much ability or the trait they need to be more likely to select that category than the one below it. Here’s what those look like for the Facebook Questionnaire.

The threshold for the neutral category is lower than for slightly disagree. People are not using that category as I intended them to – perhaps they’re using it when they generally have no opinion, for instance, rather than when they’re caught directly between agreement and disagreement. If I were developing this measure, I might question whether to drop this category, or perhaps find a better descriptor for it. Regardless, I would probably collapse this category into another one (which I usually determine based on frequencies), or possibly drop it, and rerun my analysis with a new 6-point scale to see if category function improves.

The second thing you want to look for is a good spread on those thresholds; you want them to be at least a certain number of logits apart. When you have more options on a rating scale, this adds additional cognitive effort to answer the question. So you want to make sure that each additional point on the rating scale actually gives you useful information – information that allows you to differentiate between people at one point on the ability scale and others. If two categories have basically the same threshold, it means people are having trouble differentiating the two; maybe they’re having trouble parsing the difference between “much of the time” and “most of the time,” leading people of approximately the same ability level to select these two categories about equally.

I’ve heard different guidelines on how big a “spread” is needed. Linacre, who created Winsteps, recommends 1.4 logits, and recommends collapsing categories until you’re able to attain this spread. That’s not always possible. I’ve also heard smaller, such as 0.5 logits. But either way, you definitely don’t want two categories to have the exact same observed measure or category threshold.

Also as part of the Facebook study, I administered the 5-item Satisfaction with Life Scale (Diener et al., 1985). Like the Facebook Questionnaire, this measure uses a 7-point scale (Strongly Disagree to Strongly Agree).

The middle categories are all closer together, and certainly don’t meet Linacre’s 1.4 logits guideline. I’m not as concerned about that, but I am concerned that Neither Agree nor Disagree and Slightly Agree are so close together. Just like above, where the category thresholds didn’t advance, there might be some confusion about what this “neutral” category really means. Perhaps this measure doesn’t need a 7-point scale. Perhaps it doesn’t need a neutral option. These are some issues to explore with the measure.

As a quick note, I don’t want it to appear I’m criticizing either measure. They were not developed with Rasch and this idea of category function is a Rasch-specific one. It might not be as important for these measures. But if you’re using the Rasch approach to measurement, these are ideas you need to consider. And clearly, these category function statistics can tell you a lot about whether there seems to be confusion about how a point on a rating scale is used or what it means. If you’re developing a scale, it can help you figure out what categories to combine or even drop.

Tomorrow’s post – dimensionality!

References

Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction with Life Scale. Journal of Personality Assessment, 49, 71-75.

Ross, C., Orr, E. S., Sisic, M., Arseneault, J. M., Simmering, M. G., & Orr, R. R. (2009). Personality and motivations associated with Facebook use. Computers in Human Behavior, 25, 578-586.

Tuesday, April 2, 2019

B is for Bank

As I alluded to yesterday, in Rasch, every item gets a difficulty and every person taking (some set of) those items gets an ability. They're both on the same scale, so you can compare the two, and determine which item is right for the person based on their ability and which ability is right for the person based on how they perform on the items. Rasch uses a form of maximum likelihood estimation to create these values, so it goes back and forth with different values until it gets a set of item difficulties and person abilities that fit the data.

Once you have a set of items that have been thoroughly tested and have item difficulties, you can begin building an item bank. A bank could be all the items on a given test (so the measure itself is the bank), but usually, when we refer to a bank, we're talking about a large pool of items from which to draw for a particular administration. This is how computer adaptive tests work. No person is going to see every since item in the bank.

Maintaining an item bank is a little like being a project manager. You want to be very organized and make sure you include all of the important information about the items. Item difficulty is one of the statistics you'll want to save about the items in your bank. When you administer a computer adaptive test, the difficulty of the item the person receives next is based on whether they got the previous item correct or not. If they got the item right, they get a harder item. If they got it wrong, they get an easier item. They keep going back and forth like this until we've administered enough items to be able to estimate that person's ability.

You want to maintain the text of the item: the stem, the response options, and which option is correct (the key). The bank should also include what topics each item covers. You'll have different items that cover different topics. On ability tests, those are the different areas people should know on that topic. An algebra test might have topics like linear equations, quadratic equations, word problems, and so on.

With adaptive tests, you'll want to note which items are enemies of each other. These are items that are so similar (they cover the same topic and may even have the same response options) that you'd never want to administer both in the same test. Not only do enemy items give you no new information on that person's ability to answer that type of question, they may even provide clues about each other that leads someone to the correct answer. This is bad, because the ability estimate based on this performance won't be valid - the ability you get becomes a measure of how savvy the test taker is, rather than their actual ability on the topic of the test.

On the flip side, there might be items that go together, such as a set of items that all deal with the same reading passage. So if an examinee randomly receives a certain passage of text, you'd want to make sure they then receive all items associated with that passage.

Your bank might also include items you're testing out. While some test developers will pretest items through a special event, where examinees are only receiving new items with no item difficulties, once a test has been developed, new items are usually mixed in with old items to gather item difficulty data, and to calibrate the difficulties of the new items based on difficulties of known items. (I'll talk more about this when I talk about equating.) The new items are pretest and considered unscored - you don't want performance on them to affect the person's score. The old items are operational and scored. Some tests put all pretest items in a single section; this is how the GRE does it. Others will mix them in throughout the test.

Over time, items are seen more and more. You might reach a point where an item has been seen so much that you have to assume it's been compromised - shared between examinees. You'll want to track how long an item has been in the bank and how many times it's been seen by examinees. In this case, you'll retire an item. You'll often keep retired items in the bank, just in case you want to reuse or revisit them later. Of course, if the item tests on something that is no longer relevant to practice, it may be removed from the bank completely.

Basically, your bank holds all items that could or should appear on a particular test. If you've ever been a teacher or professor and received supplementary materials with a test book, you may have gotten items that go along with the text. These are item banks. As you look through them, you might notice similarly worded items (enemy items), topic in the textbook covered, and so on. Sadly, these types of banks don't tend to have item difficulties attached to them. (Wouldn't that be nice, though? Even if you're not going to Rasch grade your students, you could make sure you have a mix of easy, medium, and hard items. But these banks don't tend to have been pretested, a fact I learned when I applied for a job to write many of these supplementary materials for textbooks.)

If you're in the licensing or credentialing world, you probably also need to track references for each item. And this will usually need to be a very specific reference, down to the page of a textbook or industry document. You may be called upon - possibly in a deposition or court - to prove that a particular item is part of practice, meaning you have to demonstrate where this item comes from and how it is required knowledge for the field. Believe me, people do challenge particular items. In the comments section at the end of a computer adaptive test, people will often reference specific items. It's rare to have to defend an item to someone, but when you do, you want to make certain you have all the information you need in one place.

Most of the examples I've used in this post have been for ability tests, and that's traditionally where we've seen item banks. But there have been some movements in testing to use banking for trait and similar tests. If I'm administering a measure of mobility to a person with a spinal cord injury, I may want to select certain items based on where that person falls on mobility. Spinal cord injury can range in severity, so while one person with a spinal cord injury might be wheelchair-bound, another might be able to walk short distances with the help of a walker (especially if they have a partial injury, meaning some signals are still getting through). You could have an item bank so that these two patients get different items; the person who is entirely wheelchair-bound wouldn't get any items about their ability to walk, while the person with the partial injury would. The computer adaptive test would just need a starting item to figure out approximately where the person falls on the continuum of mobility, then direct them to the questions most relevant to their area of the continuum.

Tomorrow, we'll talk more about the rating scale model of Rasch, and how we look at category function!

Monday, April 1, 2019

A is for Ability

Welcome to April A to Z, where I'll go through the A to Z of Rasch! This is the measurement model I use most frequently at work, and is a great way to develop a variety of measures. It started off in educational measurement, and many Rasch departments are housed in educational psychology and statistics departments. But Rasch is slowly making its way into other disciplines as well, and I hope to see more people using this measurement model. It provides some very powerful and useful statistics on measures and isn't really that difficult to learn.

A few disclaimers:

There are entire courses on Rasch measurement, and in fact, some programs divide Rasch up across several courses because there's much that can be learned on the topic. This blog series is really an introduction to the concepts, to help you get started and decide if Rasch is right for you. I won't get into the math behind it as much, but will try to use some data examples to demonstrate the concepts.
I try to be very careful about what data I present on the blog. My data examples are usually: data I personally own (that is, collected as part of my masters or doctoral study), publicly available data, or (most often) simulated data. None of the data I present will come from any of the exams I work on at my current position or past positions as a psychometrician. I don't own those data and can't share them.
Finally, these posts are part of a series in which I build on past posts to introduce new concepts. I won't always be able to say everything I want on a topic for a given post, because it ties into another post better. So keep reading - I'll try to make it clear when I'll get back to something later.

For my first post in the series, ability!

When we use measurement with people, we're trying to learn more about some underlying quality about them. We call this underlying quality we want to measure a "latent variable." It's something that can't be captured directly, but rather through proxies. In psychometrics, those proxies are the items. We go through multiple steps to design and test those items to ensure they get as close as we can to tapping into that latent variable. In science/research terms, the items are how we operationalize the latent variable: define it in a way that it can be measured.

In the Rasch approach to psychometrics, we tend to use the term "ability" rather than "latent variable." First, Rasch was originally designed for educational assessment - tests that had clear right and wrong answers - so it makes sense to frame performance on these tests as ability. Second, Rasch deals with probabilities that a person is able to answer a certain question correctly or endorse a certain answer on a test. So even for measures of traits, like attitudes and beliefs or personality inventories, people with more of a trait have a different ability to respond to different questions. Certain answers are easier for them to give because of that underlying trait.

In Rasch, we calibrate items in terms of difficulty (either how hard the question is or, for trait measures, how much of the trait is needed to respond in a certain way) and people in terms of ability. These two concepts are calibrated on the same scale, so once we have a person's ability, we can immediately determine how they're likely to respond to a given item. That scale is in a unit called a "logit," or a log odds ratio. This conversion gives us a distribution that is nearly linear. (Check out the graphs in the linked log odds ratio post to see what I mean.) Typically, when you look at your distribution of person ability measures, you'll see numbers ranging from positive to negative. And while, theoretically, your logits can range from negative infinity to positive infinity, more likely, you'll see values from -2 to +2 or something like that.

That ability metric is calculated based on the items the person responded to (and especially how they responded) - for exams with right and wrong answers, their ability is the difficulty of item at which they have a 50% chance of responding correctly. The actual analysis involved in creating these ability estimates uses some form of maximum likelihood estimation (there are different types, like Joint Maximum Likelihood Estimation, JMLE, or Pairwise Maximum Likelihood Estimation, PMLE, to name a couple), so it goes back and forth with different values until it gets estimates that best fit the data. This is both a strength and weakness of Rasch: it's an incredibly powerful analysis technique that makes full use of the data available, and handles missing data beautifully, but it couldn't possibly be done without a computer and capable software. In fact, Rasch has been around almost as long as classical test theory approaches - it just couldn't really be adopted until technology caught up.

I'll come back to this concept later this month, but before I close on this post, I want to talk about one more thing: scores. When you administer a Rasch exam and compute person ability scores, you have logits that are very useful for psychometric purposes. But rarely would you ever show those logit scores to someone else, especially an examinee. A score of -2 on a licensing exam or the SAT isn't going to make a lot of sense to an examinee. So we convert those values to a scaled score, using some form of linear equation. That precise equation differs by exam. The SAT, for instance, ranges from 400 to 1600, the ACT from 1 to 36, and the CPA exam from 0 to 99. Each of these organizations has an equation that takes their ability scores and converts them to their scaled score metric. (And in case you're wondering if there's a way to back convert your scaled score on one of these exams to the logit, you'd need to know the slope and constant of the equation they use - not something they're willing to share, nor something you could determine from a single data point. They don't really want you looking under the hood on these types of things.)

Some questions you may have:

The linear equation mentioned above takes into account the possible range of scaled scores (such as 400 to 1600 on the SAT), as well as the actual abilities of the pilot or pretest sample you used. What if you administer the test (with the set difficulties of the items from pretesting), and you get someone outside of the range of abilities of the sample you used to create the equation? Could you get someone with a score below 400 or above 1600? Yes, you could. This is part of why you want a broad range of abilities in your pilot sample, to make sure you're getting all possible outcomes and get the most accurate equation you can. This is also where simulated data could be used, which we do use in psychometrics. Once you have those item difficulties set with pilot testing, you could then create cases that get, for instance, all questions wrong or all questions right. This would give you the lowest possible logit and the highest possible logit. You can then include those values when setting the equation. As long as the difficulties are based on real performance, it's okay to use simulated cases. Finally, that range of possible scores sets the minimum and maximum for reporting. Even if an ability score means a person could have an SAT below 400, their score will be automatically set at the minimum.
What about passing and failing for licensing and credentialing exams? How do ability scores figure into that? Shouldn't you just use questions people should know and forget about all this item difficulty and person ability nonsense? This is one way that Rasch and similar measurement models differ from other test approaches. The purpose of a Rasch test is to measure the person's ability, so you need a good range of item difficulties to determine what a person's true ability is. These values are not intended to tell you whether a person should pass or fail - that's a separate part of the psychometric process called standard setting, where a committee of subject matter experts determines how much ability is necessary to say a person has the required knowledge to be licensed or credentialed. This might be why Rasch makes more sense to people when you talk about it in the educational assessment sense: you care about what the person's ability is. But Rasch is just as useful in pass/fail exams, because of the wealth of information it gives you about your exam and items. You just need that extra step of setting a standard to give pass/fail information. And even in those educational assessments, there are generally standards, such as what a person's ability should be at their age or grade level. Rasch measurement gives you a range of abilities in your sample, which you can then use to make determinations about what those abilities should be.

Tomorrow, we'll dig more into item difficulties when we discuss item banks!