Friday, May 24, 2019

I'm More Sad About This Show Ending than Game of Thrones

Like many, I eagerly waited to see how the game of thrones would end. I tore through the books available at the time shortly before the first season of Game of Thrones aired, and look forward to reading how George R.R. Martin himself would write the ending of the story.

And like many, I was disappointed in the turns taken by Game of Thrones that felt inauthentic to the characters. Especially, this was a show that failed many of its female characters. They took Brienne, who we watched grow into a strong, independent, and honorable knight, and reduced her to Carrie F***ing Bradshaw. They justified the horrible things that had happened to Sansa as character-building. (No one can make you be someone you're not. Sansa, the strength was inside you all the time. Littlefinger and Ramsay don't get credit for that. If anyone does, it's the strong women in your life, like Brienne and Arya.)

But while I'm disappointed in how the show ended, and a little sad that it's gone, I'm honestly more sad that this show is over:

Who would have guessed that a musical comedy TV show would take on some very important issues with such authenticity? Here's just a few of them (some spoilers ahead, so read on only if you've watched the show or don't care about being spoiled):

Women's Issues
Just as a short list, this show tackled periods, abortion, women's sexuality, motherhood, and body image in a way that never felt cheap, judgmental, or cliché. It was the first network show to use the word "clitoris." The relationships between the women on the show felt real and the conversations were about more than simply the men in their lives. It didn't glamorize women's bodies - in fact, it pulled back the curtain on many issues related to women's appearance and projection of themselves to the world.

Men's Issues
The show didn't just represent women authentically - the men were fully realized characters too, and never props or plot devices. Crazy Ex-Girlfriend explored men's relationships, fatherhood, and toxic masculinity and how it affects men.

Mental Health
I could probably write an entire blog post just on how this show represents mental health issues. The main character, Rebecca Bunch, is diagnosed with borderline personality disorder in season 3. And in fact, the show was building up to and establishing that diagnosis from the very beginning. The show constantly made us rethink the word "crazy" and helped to normalize many mental health issues - and when I say normalize, I mean show us that these issues are common and experienced by many people, while still encouraging those struggling with mental health issues to seek help.

The show also tackled issues like low self-esteem, self-hatred, suicide, and alcoholism, without ever glamorizing them. Instead, it encouraged us to take better care of ourselves, and recognize when we have a problem we can't handle ourselves.

When bisexuals show up in other movies or TV shows, they're often portrayed as promiscuous - people who are bi because they want to have sex with everyone. Either that, or they portray it, especially among men, as someone who is actually gay but not comfortable with coming fully out of the closet. Not Crazy Ex-Girlfriend.

Race and Ethnicity
This show has a diverse cast. And unlike many shows with "diversity," none of the characters are tokens. In fact, race and ethnicity aren't referenced so much as heritage. Further, the show pokes fun at the token concept. One great episode deals with Heather's ethnicity. Her boss, Kevin, encourages her to join a management training program because she is "diverse." Later, he gives her a gift to apologize for his insensitivity: a sari, because he assumes she is Indian. She corrects him; her father is African-American and her mother is White. The extra layer here is that the actress who plays Heather, Vella Lovell, has been mistakenly called Indian in the media, when she, like her character, is African-American and White. So this episode not only makes fun of the concept of the token, it also makes fun of the media trying so hard to ascertain and define an actor by her race.

Crazy Ex-Girlfriend, I'm really going to miss you.

Wednesday, May 22, 2019

New Color Palette for R

As I was preparing some graphics for a presentation recently, I started digging into some of the different color palette options. My motivation was entirely about creating graphics that weren't too visually overwhelming, which I found the default "rainbow" palette to be.

But as the creators of the viridis R package point out, we also need to think about how people with colorblindness might struggle with understanding graphics. If you create figures in R, I highly recommend checking it out at the link above!

Monday, April 15, 2019

J is for Journal of Applied Measurement (and Other Resources)

As with many fields and subfields, Rasch has its own journal - the Journal of Applied Measurement, which publishes a variety of articles either using or describing how to use Rasch measurement. You can read the table of contents for JAM going back to its inaugural issue here.

But JAM isn't the only resource available for Rasch users. First off, JAM Press publishes multiple books on different aspects of Rasch measurement.

But the most useful resource by far is Rasch Measurement Transactions, which goes back to 1987 and is freely available. These are shorter articles dealing with hands on topics. If I don't know how to do something regarding Rasch, this is a great place to check. And you can always find those articles on a certain topic via Google, by setting the site to search as "".

Finally, there is a Rasch message board, which is still active, where you can post questions (and answer them if you feel so inclined!).

As you can see, I'm a bit behind on A to Z posts. I'll be playing catch up this week!

Wednesday, April 10, 2019

I is for Item Fit

Rasch gives you lots of item-level data. Not only difficulties, but Rasch analysis will also produce fit indices, for both items and persons. Just like the log-likelihood chi-square statistic that tells you how well your data fit the Rasch model, you also receive item fit indices, which compare observed to expected (based on the Rasch model) responses. These indices are also based on chi-square statistics.

There are two types of fit indices: INFIT and OUTFIT.

OUTFIT is sensitive to Outliers. They are responses that fall outside of the targeted ability level, such as a high ability respondent missing an item targeted to their ability level, or a low ability respondent getting a difficult item correct. This could reflect a problem with the item - perhaps it's poorly worded and is throwing off people who actually know the information. Or perhaps there's a cue that is leading people to the correct answer who wouldn't otherwise get it right. These statistics can cue you in to problems with the item.

INFIT (Information weighted) is sensitive to responses that are too predictable. These items don't tell you anything you don't already know from other items. Every item should contribute to the estimate. More items is not necessarily better - this is one way Rasch differs from Classical Test Theory, where adding more items increases reliability. The more items you give a candidate, the greater your risk of fatigue, which will lead reliability (and validity) to go down. Every item should contribute meaningful, and unique, data. These statistics cue you in on items that might not be necessary.

The expected value for both of these statistics is 1.0. Any items that deviate from that value might be problematic. Linacre recommends a cut-off of 2.0, where any items that have an INFIT or OUTFIT of 2.0 or greater should be dropped from the measure. Test developers will sometimes adopt their own cut-off values, such as 1.5 or 1.7. If you have a large bank, you can probably afford to be more conservative and drop items above 1.5. If you're developing a brand new test or measure, you might want to be more lenient and use the 2.0 cut-off. Whatever you do, just be consistent and cite the literature whenever you can to support your selected cut-off.

Though this post is about item fit, these same statistics also exist for each person in your dataset. A misfitting person means the measure is not functioning the same for them as it does for others. This could mean the candidate got lazy and just responded at random. Or it could mean the measure isn't valid for them for some reason. (Or it could just be chance.) Many Rasch purists see no issue with dropping people who don't fit the model, but as I've discovered when writing up the results of Rasch analysis for publication, reviewers don't take kindly to dropping people unless you have other evidence to support it. (And since Rasch is still not a well-known approach, they mean evidence outside of Rasch analysis, like failing a manipulation check.)

The best approach I've seen is once again recommended by Linacre: persons with very high OUTFIT statistics are removed and ability estimates from the smaller sample are cross-plotted against the estimates from the full sample. If removal of these persons has little effect on the final estimates, these persons can be retained, because they don't appear to have any impact on the results. That is, they're not driving the results. 

If there is a difference, Linacre recommends next examining persons with smaller (but still greater than 2.0) OUTFIT statistics and cross-plotting again. Though there is little guidance on how to define very high and high, in my research, I frequently use an OUTFIT of 3.0 for ‘very high’ and 2.0 for ‘high.’ In my experience, the results of such sensitivity analysis never shows any problem, and I'm able to justify keeping everyone in the sample. This seems to make both reviewers and Rasch purists happy.

Tuesday, April 9, 2019

H is for How to Set Up Your Data File

The exact way you set up your data of course depends on the exact software you use. But my focus today is to give things to think about if/when setting up your data for Rasch analysis.

First, know how your software needs you to format missing values. Many programs will let you simply leave a blank space or cell. Winsteps is fine with a blank space to notate a missing value or skipped question. Facets, on the other hand, will flip out at a blank space and needs a missing value set up (usually I use 9).

Second, ordering of the file is very important, especially if you're working with data from a computer adaptive test, meaning missing values is also important. When someone takes a computer adaptive test, their first item is drawn at random from a set of moderately difficult items. The difficulty of the next item depends on how they did on the first item, but even so, the item is randomly drawn from a set or range of items. So when you set up your data file, you need to be certain that all people who responded to a specific item have that response in the same column (not necessarily where the item was administered numerically in the exam).

This why you need to be meticulously organized with your item bank and give each item an identifier. When you assemble responses for computer adaptive tests, you'll need to reorder people's responses. That is, you'll set up an order for every item in the bank by identifier. When data are compiled, their responses are put in that order, and if a particular item in the bank wasn't administered, there would be a space or missing value there.

Third, be sure you differentiate between item variables and other variables, like person identifiers, demographics, and so on. Once again, know your software. You may find that a piece of software just runs an entire dataset as though all variables are items, meaning you'll get weird results if you have a demographic variable mixed in. Others might let you select certain variables for the analysis and/or categorize variables as items and non-items.

I tend to keep a version of my item set in Excel, with a single variable at the beginning with participant ID number. Excel is really easy to import into most software, and I can simply delete the first column if a particular program doesn't allow non-item variables. If I drop any items (which I'll talk more about tomorrow), I do it from this dataset. A larger dataset, with all items, demographic variables, and so on is kept usually in SPSS, since that's the preferred software at my company (I'm primarily an R user, but I'm the only one and R can read SPSS files directly) in case I ever need to pull in any additional variables for group comparisons. This dataset is essentially the master and any smaller files I need are built from it.

Monday, April 8, 2019

G is for Global Fit Statistics

One of the challenges of Blogging A to Z is that the posts have to go in alphabetical order, even if it would make sense to start with a topic from the middle of the alphabet. It's more like creating the pieces of a puzzle, that could (and maybe should) be put together in a different order than they were created. But like I said, it's a challenge!

So now that we're on letter G, it's time to talk about a topic that I probably would have started with otherwise: what exactly is the Rasch measurement model? Yes, it is a model that allows you to order people according to their abilities (see letter A) and items according to their difficulties. But more that that, it's a prescriptive model of measurement - it contains within it the mathematical properties of a good measure (or at least, one definition of a good measure). This is how Rasch differs from Item Response Theory (IRT) models, though Rasch is often grouped into IRT despite its differences. You see, mathematically, Rasch is not very different from the IRT 1-parameter model, which focuses on item difficulty (and by extension, person ability). But philosophically, it is very different, because while Rasch is prescriptive, IRT is descriptive. If an IRT 1-parameter model doesn't adequately describe the data, you could just select a different IRT model. But Rasch says that the data must fit its model, and it gives you statistics to tell you how well it does. If your data don't fit the model, the deficiency is with the data (and your measure), not the model itself.

Note: This is outside of the scope of this blog series, but in IRT, the second parameter is item discrimination (how well the item differentiates between high and low ability candidates) and the third is the pseudo-guessing parameter (the likelihood you'd get an answer correct based on chance alone). The Rasch model assumes that the item discrimination for all items is 1.0 and does not make any corrections for potential guessing. You know how the SAT penalizes you for wrong answers? It's to discourage guessing. They don't want to you answering a question if you don't know the answer; a lucky guess is not a valid measure. What can I say, man? We psychometricians are a**holes.

When you use Rasch, you're saying that you have a good measure when it gives you data that fits the Rasch model. Poor fit to the Rasch model means you need to rework your measure - perhaps dropping items, collapsing response scales, or noting inconsistencies in person scores that mean their data might not be valid (and could be dropped from the analysis).

For Blogging A to Z 2017, I went through the alphabet of statistics, and for the letter G, I talked about goodness of fit. In Rasch, we look at our global fit statistics to see how well our data fit the prescribed Rasch model. If our data don't fit, we start looking at why and retooling our measure so it does.

The primary global fit statistic we should look at is the log-likelihood chi square statistic, which, as the name implies, is based on the chi square distribution. A significant chi-square statistic in this case means the data significantly differs from the model. Just like in structural equation model, it is a measure of absolute fit.

There are other fit statistics you can look at, such as the Akaike Information Criterion (AIC) and Schwarz Bayesian Information Criterion (BIC). These statistics are used for model comparison (relative fit), where you might test out different Rasch approaches to see what best describes the data (such as a Rating Scale Model versus a Partial Credit Model) or see if changes to the measure (like dropping items) results in better fit. These values are derived from the log-likehood statistic and either degrees of freedom for the AIC or number of non-extreme cases (in Rasch, extreme cases would be those that got every item right or every item wrong) for the BIC. (You can find details and the formulas for AIC and BIC here.)

BIC seems to be the preferred metric here, since it accounts for extreme cases; a measure with lots of extreme cases is not as informative as a measure with few extreme cases, so this metric can help you determine if dropping too easy or too difficult items improves your measure (it probably will, but this lets you quantify that).

Tomorrow, I'll talk about setting up your data for Rasch analysis.

Sunday, April 7, 2019

F is for Facets

So far this month, I've talked about the different things that affect the outcome of measurement - in that it determines how someone will respond. Those things so far would be item difficulty and person ability. How much ability a person has or how much of the trait they possess will affect how they respond to items of varying ability. Each of things that interacts to affect the outcome of Rasch measurement is called a "facet."

But these don't have to be the only facets in a Rasch analysis. You can have additional facets, making for a more complex model. In our content validation studies, we administer a job analysis survey asking people to rate different job-related tasks. As is becoming standard in this industry, we use two different scales for each item, one rating how frequently this task is performed (so we can place more weight on the more frequently performed items) and one rating how critical it is to perform this task competently to protect the public (so we can place more weight on the highly critical items). In this model, we can differentiate between the two scales and see how the scale used changes how people respond. This means that rating scale also becomes a facet, one with two levels: frequency scale and criticality scale.

When conducting a more complex model like this, we need software that can handle these complexities. The people who brought us Winsteps, the software I use primarily for Rasch analysis, also have a program called Facets, which can handle these more complex models.

In a previous blog post, I talked about a facets model I was working with, one with four facets: person ability, item difficulty, rating scale, and timing (whether they received frequency first or second). But one could use a facets model for other types of data, like judge rating data. The great thing about using facets to examine judge data is that one can also partial out concepts like judge leniency; that is, some judges go "easier" on people than others, and a facets models lets you model that leniency. You would just need to have your judges rate more than one person in the set and have some overlap with other judges, similar to the overlap I introduced in the equating post.

This is the thing I love about Rasch measurement, in that it is a unique approach to measurement that can expand in complexity to whatever measurement situation you're presented with. It's all based on the Rasch measurement model, a mathematical model that represents the characteristics a "good measure" should possess - that's what we'll talk about tomorrow when we examine global fit statistics!