Monday, April 15, 2019

J is for Journal of Applied Measurement (and Other Resources)

As with many fields and subfields, Rasch has its own journal - the Journal of Applied Measurement, which publishes a variety of articles either using or describing how to use Rasch measurement. You can read the table of contents for JAM going back to its inaugural issue here.

But JAM isn't the only resource available for Rasch users. First off, JAM Press publishes multiple books on different aspects of Rasch measurement.

But the most useful resource by far is Rasch Measurement Transactions, which goes back to 1987 and is freely available. These are shorter articles dealing with hands on topics. If I don't know how to do something regarding Rasch, this is a great place to check. And you can always find those articles on a certain topic via Google, by setting the site to search as "rasch.org".

Finally, there is a Rasch message board, which is still active, where you can post questions (and answer them if you feel so inclined!).

As you can see, I'm a bit behind on A to Z posts. I'll be playing catch up this week!


Wednesday, April 10, 2019

I is for Item Fit

Rasch gives you lots of item-level data. Not only difficulties, but Rasch analysis will also produce fit indices, for both items and persons. Just like the log-likelihood chi-square statistic that tells you how well your data fit the Rasch model, you also receive item fit indices, which compare observed to expected (based on the Rasch model) responses. These indices are also based on chi-square statistics.

There are two types of fit indices: INFIT and OUTFIT.

OUTFIT is sensitive to Outliers. They are responses that fall outside of the targeted ability level, such as a high ability respondent missing an item targeted to their ability level, or a low ability respondent getting a difficult item correct. This could reflect a problem with the item - perhaps it's poorly worded and is throwing off people who actually know the information. Or perhaps there's a cue that is leading people to the correct answer who wouldn't otherwise get it right. These statistics can cue you in to problems with the item.

INFIT (Information weighted) is sensitive to responses that are too predictable. These items don't tell you anything you don't already know from other items. Every item should contribute to the estimate. More items is not necessarily better - this is one way Rasch differs from Classical Test Theory, where adding more items increases reliability. The more items you give a candidate, the greater your risk of fatigue, which will lead reliability (and validity) to go down. Every item should contribute meaningful, and unique, data. These statistics cue you in on items that might not be necessary.

The expected value for both of these statistics is 1.0. Any items that deviate from that value might be problematic. Linacre recommends a cut-off of 2.0, where any items that have an INFIT or OUTFIT of 2.0 or greater should be dropped from the measure. Test developers will sometimes adopt their own cut-off values, such as 1.5 or 1.7. If you have a large bank, you can probably afford to be more conservative and drop items above 1.5. If you're developing a brand new test or measure, you might want to be more lenient and use the 2.0 cut-off. Whatever you do, just be consistent and cite the literature whenever you can to support your selected cut-off.

Though this post is about item fit, these same statistics also exist for each person in your dataset. A misfitting person means the measure is not functioning the same for them as it does for others. This could mean the candidate got lazy and just responded at random. Or it could mean the measure isn't valid for them for some reason. (Or it could just be chance.) Many Rasch purists see no issue with dropping people who don't fit the model, but as I've discovered when writing up the results of Rasch analysis for publication, reviewers don't take kindly to dropping people unless you have other evidence to support it. (And since Rasch is still not a well-known approach, they mean evidence outside of Rasch analysis, like failing a manipulation check.)

The best approach I've seen is once again recommended by Linacre: persons with very high OUTFIT statistics are removed and ability estimates from the smaller sample are cross-plotted against the estimates from the full sample. If removal of these persons has little effect on the final estimates, these persons can be retained, because they don't appear to have any impact on the results. That is, they're not driving the results. 

If there is a difference, Linacre recommends next examining persons with smaller (but still greater than 2.0) OUTFIT statistics and cross-plotting again. Though there is little guidance on how to define very high and high, in my research, I frequently use an OUTFIT of 3.0 for ‘very high’ and 2.0 for ‘high.’ In my experience, the results of such sensitivity analysis never shows any problem, and I'm able to justify keeping everyone in the sample. This seems to make both reviewers and Rasch purists happy.


Tuesday, April 9, 2019

H is for How to Set Up Your Data File

The exact way you set up your data of course depends on the exact software you use. But my focus today is to give things to think about if/when setting up your data for Rasch analysis.

First, know how your software needs you to format missing values. Many programs will let you simply leave a blank space or cell. Winsteps is fine with a blank space to notate a missing value or skipped question. Facets, on the other hand, will flip out at a blank space and needs a missing value set up (usually I use 9).

Second, ordering of the file is very important, especially if you're working with data from a computer adaptive test, meaning missing values is also important. When someone takes a computer adaptive test, their first item is drawn at random from a set of moderately difficult items. The difficulty of the next item depends on how they did on the first item, but even so, the item is randomly drawn from a set or range of items. So when you set up your data file, you need to be certain that all people who responded to a specific item have that response in the same column (not necessarily where the item was administered numerically in the exam).

This why you need to be meticulously organized with your item bank and give each item an identifier. When you assemble responses for computer adaptive tests, you'll need to reorder people's responses. That is, you'll set up an order for every item in the bank by identifier. When data are compiled, their responses are put in that order, and if a particular item in the bank wasn't administered, there would be a space or missing value there.

Third, be sure you differentiate between item variables and other variables, like person identifiers, demographics, and so on. Once again, know your software. You may find that a piece of software just runs an entire dataset as though all variables are items, meaning you'll get weird results if you have a demographic variable mixed in. Others might let you select certain variables for the analysis and/or categorize variables as items and non-items.

I tend to keep a version of my item set in Excel, with a single variable at the beginning with participant ID number. Excel is really easy to import into most software, and I can simply delete the first column if a particular program doesn't allow non-item variables. If I drop any items (which I'll talk more about tomorrow), I do it from this dataset. A larger dataset, with all items, demographic variables, and so on is kept usually in SPSS, since that's the preferred software at my company (I'm primarily an R user, but I'm the only one and R can read SPSS files directly) in case I ever need to pull in any additional variables for group comparisons. This dataset is essentially the master and any smaller files I need are built from it.


Monday, April 8, 2019

G is for Global Fit Statistics

One of the challenges of Blogging A to Z is that the posts have to go in alphabetical order, even if it would make sense to start with a topic from the middle of the alphabet. It's more like creating the pieces of a puzzle, that could (and maybe should) be put together in a different order than they were created. But like I said, it's a challenge!

So now that we're on letter G, it's time to talk about a topic that I probably would have started with otherwise: what exactly is the Rasch measurement model? Yes, it is a model that allows you to order people according to their abilities (see letter A) and items according to their difficulties. But more that that, it's a prescriptive model of measurement - it contains within it the mathematical properties of a good measure (or at least, one definition of a good measure). This is how Rasch differs from Item Response Theory (IRT) models, though Rasch is often grouped into IRT despite its differences. You see, mathematically, Rasch is not very different from the IRT 1-parameter model, which focuses on item difficulty (and by extension, person ability). But philosophically, it is very different, because while Rasch is prescriptive, IRT is descriptive. If an IRT 1-parameter model doesn't adequately describe the data, you could just select a different IRT model. But Rasch says that the data must fit its model, and it gives you statistics to tell you how well it does. If your data don't fit the model, the deficiency is with the data (and your measure), not the model itself.

Note: This is outside of the scope of this blog series, but in IRT, the second parameter is item discrimination (how well the item differentiates between high and low ability candidates) and the third is the pseudo-guessing parameter (the likelihood you'd get an answer correct based on chance alone). The Rasch model assumes that the item discrimination for all items is 1.0 and does not make any corrections for potential guessing. You know how the SAT penalizes you for wrong answers? It's to discourage guessing. They don't want to you answering a question if you don't know the answer; a lucky guess is not a valid measure. What can I say, man? We psychometricians are a**holes.

When you use Rasch, you're saying that you have a good measure when it gives you data that fits the Rasch model. Poor fit to the Rasch model means you need to rework your measure - perhaps dropping items, collapsing response scales, or noting inconsistencies in person scores that mean their data might not be valid (and could be dropped from the analysis).

For Blogging A to Z 2017, I went through the alphabet of statistics, and for the letter G, I talked about goodness of fit. In Rasch, we look at our global fit statistics to see how well our data fit the prescribed Rasch model. If our data don't fit, we start looking at why and retooling our measure so it does.

The primary global fit statistic we should look at is the log-likelihood chi square statistic, which, as the name implies, is based on the chi square distribution. A significant chi-square statistic in this case means the data significantly differs from the model. Just like in structural equation model, it is a measure of absolute fit.

There are other fit statistics you can look at, such as the Akaike Information Criterion (AIC) and Schwarz Bayesian Information Criterion (BIC). These statistics are used for model comparison (relative fit), where you might test out different Rasch approaches to see what best describes the data (such as a Rating Scale Model versus a Partial Credit Model) or see if changes to the measure (like dropping items) results in better fit. These values are derived from the log-likehood statistic and either degrees of freedom for the AIC or number of non-extreme cases (in Rasch, extreme cases would be those that got every item right or every item wrong) for the BIC. (You can find details and the formulas for AIC and BIC here.)

BIC seems to be the preferred metric here, since it accounts for extreme cases; a measure with lots of extreme cases is not as informative as a measure with few extreme cases, so this metric can help you determine if dropping too easy or too difficult items improves your measure (it probably will, but this lets you quantify that).

Tomorrow, I'll talk about setting up your data for Rasch analysis.

Sunday, April 7, 2019

F is for Facets

So far this month, I've talked about the different things that affect the outcome of measurement - in that it determines how someone will respond. Those things so far would be item difficulty and person ability. How much ability a person has or how much of the trait they possess will affect how they respond to items of varying ability. Each of things that interacts to affect the outcome of Rasch measurement is called a "facet."

But these don't have to be the only facets in a Rasch analysis. You can have additional facets, making for a more complex model. In our content validation studies, we administer a job analysis survey asking people to rate different job-related tasks. As is becoming standard in this industry, we use two different scales for each item, one rating how frequently this task is performed (so we can place more weight on the more frequently performed items) and one rating how critical it is to perform this task competently to protect the public (so we can place more weight on the highly critical items). In this model, we can differentiate between the two scales and see how the scale used changes how people respond. This means that rating scale also becomes a facet, one with two levels: frequency scale and criticality scale.

When conducting a more complex model like this, we need software that can handle these complexities. The people who brought us Winsteps, the software I use primarily for Rasch analysis, also have a program called Facets, which can handle these more complex models.

In a previous blog post, I talked about a facets model I was working with, one with four facets: person ability, item difficulty, rating scale, and timing (whether they received frequency first or second). But one could use a facets model for other types of data, like judge rating data. The great thing about using facets to examine judge data is that one can also partial out concepts like judge leniency; that is, some judges go "easier" on people than others, and a facets models lets you model that leniency. You would just need to have your judges rate more than one person in the set and have some overlap with other judges, similar to the overlap I introduced in the equating post.

This is the thing I love about Rasch measurement, in that it is a unique approach to measurement that can expand in complexity to whatever measurement situation you're presented with. It's all based on the Rasch measurement model, a mathematical model that represents the characteristics a "good measure" should possess - that's what we'll talk about tomorrow when we examine global fit statistics!


Friday, April 5, 2019

E is for Equating

In the course of any exam, new items and even new forms have to be written. The knowledge base changes and new topics become essential. How do we make sure these new items contribute to the overall test score? Through a process called equating.

Typically in Rasch, we equate test items by sprinkling new items in with old items. When we run our Rasch analysis on items and people, we "anchor" the old items to their already established item difficulties. When item difficulties are calculated for the new items, they are now on the exact same metric as the previous items, and new difficulties are established relative to the ones that have already been set through pretesting.

It's not even necessary for everyone to receive all pretest items - or even all of the old items. You just need enough overlap to create links between old and new items. In fact, when you run data from a computer adaptive test, there are a lot "holes" in the data, creating a sparse matrix.


In the example above, few examinees received the exact same combination of items, but with an entire dataset that looks like this (and more examinees, of course), we could estimate item difficulties for new items in the set.

But you may also ask about the "old" items - do we always stick with the same difficulties or do we change those from time to time? Whenever you anchor items to existing item difficulties, the program will still estimate fresh item difficulties and let you examine something called "displacement": how much the newly estimated difficulty differs from the anchored one. You want to look at these and make sure you're not experiencing what's called "item drift," which happens when an item becomes easier or harder over time.

This definitely happens. Knowledge that might have previously been considered difficult can become easier over time. Math is a great example. Many of the math classes my mom took in high school and college (and that may have been electives) were offered to me in middle school or junior high (and were mandatory). Advances in teaching, as well as better understanding of how these concepts work, can make certain concepts easier.

On the flipside, some items could get more difficult over time. My mom was required to take Latin in her college studies. I know a little "church Latin" but little else. Items about Latin would have been easier for my mom's generation, when the language was still taught, than mine. And if we could hop in a time machine and go back to when scientific writing was in Latin (and therefore, to be a scientist, you had to speak/write Latin fluently), these items would be even easier for people.

Essentially, even though Rasch offers an objective method of measurement and a way to quantify item difficulty, difficulty is still a relative concept and can change over time. Equating is just one part of the type of testing you must regularly do with your items to maintain a test or measure.

Thursday, April 4, 2019

D is for Dimensionality

We're now 4 posts into Blogging A to Z, and I haven't really talked much about the assumptions of Rasch. That is, like any statistical test, there are certain assumptions you must meet in order for results to be valid. The same is true of Rasch. One of the key assumptions of Rasch is that the items all measure the same thing - the same latent variable. Put another way, your measure should be unidimensional: only assessing one dimension.

This is because, in order for the resulting item difficulties and person abilities to be valid - and comparable to each other - they have to be assessing the same thing. It wouldn't make sense to compare one person's math test score to a reading test score. And it wouldn't make sense to combine math and reading items into the same test; what exactly does the exam measure?

But the assumption of unidimensionality goes even further than that. You also want to be careful that each individual item is only measuring one thing. This is harder than it sounds. While a certain reading ability is needed for just about any test question, and some items will need to use jargon, a math item written at an unnecessarily high reading level is actually measuring two dimensions: math ability and reading ability. The same is true for poorly written items that give clues as to the correct answer or trick questions that trip people up even when they have the knowledge. Your test is not only of ability on the test topic, but a second ability: test savviness. The only thing that impacts whether a person gets an item correct should be their ability level in that domain. That is an assumption we have to make for a measurement to be valid. That is, if a person with high math ability gets an item incorrect because of low reading ability, we've violated that assumption. And if a person with low ability gets an item correct because there was "all of the above" option, we've violated that assumption.

How do we assess dimensionality? One way is with principal components analysis (PCA), which I've blogged about before. As a demonstration, I combined two separate measures (the Satisfaction with Life Scale and the Center for Epidemiologic Studies Depression measure) and ran them through a Rasch analysis as though they were a single measure. The results couldn't have been more perfect if I tried. The PCA results showed 2 clear factors - one made up of the 5 SWLS items and one made up of the 16 CESD items. Here's some of the output Winsteps gave me for the PCA. The first looks at explained variance and Eigenvalues:


There are two things to look at when running a PCA for a Rasch measure. 1. The variance explained by the measures should be at least 40%. 2. The Eigenvalue for the first contrast should be less than 2. As you can see, the Eigenvalue right next to the text "Unexplned variance in 1st contrast" is 7.76, while the values for the 2nd contrast (and on) are less than 2. That means we have two dimensions in the data - and those dimensions are separated out when we scroll down to look at the results of the first contrast. Here are those results in visual form - the letters in the plot refer to the items.


You're looking for two clusters of items, which we have in the upper left-hand corner and the lower right-hand corner. The next part identifies which items the letters refer to and gives the factor loadings:


The 5 items on the left side of the table are from the SWLS. The 16 on the right are all from the CESD. Like I said, perfect results.

But remember that PCA is entirely data-driven, so it could identify additional factors that don't have any meaning beyond that specific sample. So the PCA shouldn't be the only piece of evidence for unidimensionality, and there might be cases where you forgo that analysis altogether. If you've conducted a content validation study, that can be considered evidence that these items all belong on the same measure and assess the same thing. That's because content validation studies are often based on expert feedback, surveys of people who work in the topic area being assessed, job descriptions, and textbooks. Combining and cross-validating these different sources can be much stronger evidence than an analysis that is entirely data dependent.

But the other question is, how do we ensure all of our items are unidimensional? Have clear rules for how items should be written. Avoid "all of the above" or "none of the above" as answers - they're freebie points, because if they're an option, they're usually the right one. And if you just want to give away freebie points, why even have a test at all? Also make clear rules about what terms and jargon can be used in the test, and run readability analysis on your items - ignore those required terms and jargon and try to make the reading level of everything else as consistent across the exam as possible (and as low as you can get away with). When I was working on creating trait measures in past research, the guidelines from our IRB were usually that measures should be written at a 6th-8th grade reading level. And unless a measure is of reading level, that's probably a good guideline to try to follow.

Tomorrow, we'll talk about equating test items!