Deeply Trivial: psychometrics

Showing posts with label psychometrics. Show all posts

Saturday, April 18, 2020

P is for percent

We've used ggplots throughout this blog series, but today, I want to introduce another package that helps you customize scales on your ggplots - the scales package. I use this package most frequently to format scales as percent. There aren't a lot of good ways to use percents with my dataset, but one example would be to calculate the percentage each book contributes to the total pages I read in 2019.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv",
                      col_names = TRUE)

## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )

reads2019 <- reads2019 %>%
  mutate(perpage = Pages/sum(Pages))

The new variable, perpage, is a proportion. But if I display those data with a figure, I want them to be percentages instead. Here's how to do that. (If you don't already have the scales package, add install.packages("scales") at the beginning of this code.)

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

reads2019 %>%
  ggplot(aes(perpage)) +
  geom_histogram() +
  scale_x_continuous(labels = percent, breaks = seq(0,.05,.005)) +
  xlab("Percentage of Total Pages Read") +
  ylab("Books")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You need to make sure you load the scales package before you add the labels = percent attribute, or you'll get an error message. Alternatively, you can tell R to use the scales package just for this attribute by adding scales:: before percent. This trick becomes useful when you have lots of packages loaded that use the same function names, because R will use the most recently loaded package for that function, and mask it from any other packages.

This post also seems like a great opportunity to hop on my statistical highhorse and talk about the difference between a histogram and a bar chart. Why is this important? With everything going on in the world - pandemics, political elections, etc. - I've seen lots of comments on others' intelligence, many of which show a misunderstanding of the most well-known histogram: the standard normal curve. You see, raw data, even from a huge number of people and even on a standardized test, like a cognitive ability (aka: IQ) test, is never as clean or pretty as it appears in a histogram.

Histograms use a process called "binning", where ranges of scores are combined to form one of the bars. The bins can be made bigger (including a larger range of scores) or smaller, and smaller bins will start showing the jagged nature of most data, even so-called normally distributed data.

As one example, let's show what my percent figure would look like as a bar chart instead of a histogram (like the one above).

reads2019 %>%
  ggplot(aes(perpage)) +
  geom_bar() +
  scale_x_continuous(labels = percent, breaks = seq(0,.05,.005)) +
  xlab("Percentage of Total Pages Read") +
  ylab("Books")

As you can see, lots of books were binned together for the histogram. I can customize the number of bins in my histogram, but unless I set it to give one bin to each x value, the result will be much cleaner than the bar chart. The same is true for cognitive ability scores. Each bar is a bin, and that bin contains a range of values. So when we talk about scores on a standardized test, we're really referring to a range of scores.

Now, my reading dataset is small - only 87 observations. What happens if I generate a large, random dataset?

set.seed(42)

test <- tibble(ID = c(1:10000),
               value = rnorm(10000))

test %>%
  ggplot(aes(value)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

See that "stat_bin()" warning message? It's telling me that there are 30 bins, so R divided up the range of scores into 30 equally sized bins. What happens when I increase the number of bins? Let's go really crazy and have it create one bin for each score value.

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

test %$% n_distinct(value)

## [1] 10000

test %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 10000)

Not nearly so pretty, is it? Mind you, this is 10,000 values randomly generated to follow the normal distribution. When you give each value a bin, it doesn't look very normally distributed.

How about if we mimic cognitive ability scores, with a mean of 100 and a standard deviation of 15? I'll even force it to have whole numbers, so we don't have decimal places to deal with.

CogAbil <- tibble(Person = c(1:10000),
                  Ability = rnorm(10000, mean = 100, sd = 15))

CogAbil <- CogAbil %>%
  mutate(Ability = round(Ability, digits = 0))

CogAbil %$%
  n_distinct(Ability)

## [1] 103

CogAbil %>%
  ggplot(aes(Ability)) +
  geom_histogram() +
  labs(title = "With 30 bins") +
  theme(plot.title = element_text(hjust = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

CogAbil %>%
  ggplot(aes(Ability)) +
  geom_histogram(bins = 103) +
  labs(title = "With 1 bin per whole-point score") +
  theme(plot.title = element_text(hjust = 0.5))

(Now, there's more that goes into developing a cognitive ability test, because the original scale of the test (raw scores) differ from the standardized scale that is applied to turn raw scores into one with a mean of 100 and standard deviation of 15. That's where an entire field's worth of knowledge (psychometrics) comes in.)

This is not to say histograms lie - they simplify. And they're not really meant to be used the way many people try to use them.

Tuesday, January 14, 2020

Updates

New year, new job, new blog post describing it all. On January 6, I started working as a Data Analyst at the American Board of Medical Specialties, which oversees certification and maintenance of certification activities for 24 Member Boards (such as the American Board of Dermatology, American Board of Nuclear Medicine, and so on).

The main part of my job will be doing analysis, research, and program evaluation of the CertLink program, which is a really cool online system that tests physician knowledge in their certification area, provides feedback and introduces new information to improve over time, and measures the relevance of items to their practice, so that their maintenance of certification assessments can become more targeted to the population and types of cases they encounter in their practice. We're hoping that this kind of system will become the future of medical specialty certification, so rather than taking a high stakes exam every 10 years, medical specialists can maintain their certifications through targeted, longitudinal assessment and continuing education. And we're hoping to show this approach works by tying it to long-term, quality of care outcomes, like prescribing patterns. I'll share more as I learn more about the company and my role, to the degree that I can based on data privacy. But I'm so excited to be involved with this, using my psychometrics and statistics skills for the data I'm working with, and my research/program evaluation skills to show (how) the system works. I also finally get to use my SQL knowledge as part of my job, and will be using my R and Python programming skills pretty regularly as well.

Zeppelin is adjusting well to me working again. He adores his dog walker, who he sees three times a week, and has made many new friends in the doggy daycare he attends twice a week. He also has a huge crush on Mona, who can be found at Uncharted Books, stopping to stare longingly at her every time we walk by the shop. As is the case with so many crushes, this love seems to be unrequited; Mona tolerates Zeppelin but doesn't like the way he drinks out of her water bowl when we stop in.

On the blogging front, I'm working on an analysis of the 88 books I read last year, and might even do some long-term analysis of my last few years of reading data. Stay tuned for that.

Wednesday, April 10, 2019

I is for Item Fit

Rasch gives you lots of item-level data. Not only difficulties, but Rasch analysis will also produce fit indices, for both items and persons. Just like the log-likelihood chi-square statistic that tells you how well your data fit the Rasch model, you also receive item fit indices, which compare observed to expected (based on the Rasch model) responses. These indices are also based on chi-square statistics.

There are two types of fit indices: INFIT and OUTFIT.

OUTFIT is sensitive to Outliers. They are responses that fall outside of the targeted ability level, such as a high ability respondent missing an item targeted to their ability level, or a low ability respondent getting a difficult item correct. This could reflect a problem with the item - perhaps it's poorly worded and is throwing off people who actually know the information. Or perhaps there's a cue that is leading people to the correct answer who wouldn't otherwise get it right. These statistics can cue you in to problems with the item.

INFIT (Information weighted) is sensitive to responses that are too predictable. These items don't tell you anything you don't already know from other items. Every item should contribute to the estimate. More items is not necessarily better - this is one way Rasch differs from Classical Test Theory, where adding more items increases reliability. The more items you give a candidate, the greater your risk of fatigue, which will lead reliability (and validity) to go down. Every item should contribute meaningful, and unique, data. These statistics cue you in on items that might not be necessary.

The expected value for both of these statistics is 1.0. Any items that deviate from that value might be problematic. Linacre recommends a cut-off of 2.0, where any items that have an INFIT or OUTFIT of 2.0 or greater should be dropped from the measure. Test developers will sometimes adopt their own cut-off values, such as 1.5 or 1.7. If you have a large bank, you can probably afford to be more conservative and drop items above 1.5. If you're developing a brand new test or measure, you might want to be more lenient and use the 2.0 cut-off. Whatever you do, just be consistent and cite the literature whenever you can to support your selected cut-off.

Though this post is about item fit, these same statistics also exist for each person in your dataset. A misfitting person means the measure is not functioning the same for them as it does for others. This could mean the candidate got lazy and just responded at random. Or it could mean the measure isn't valid for them for some reason. (Or it could just be chance.) Many Rasch purists see no issue with dropping people who don't fit the model, but as I've discovered when writing up the results of Rasch analysis for publication, reviewers don't take kindly to dropping people unless you have other evidence to support it. (And since Rasch is still not a well-known approach, they mean evidence outside of Rasch analysis, like failing a manipulation check.)

The best approach I've seen is once again recommended by Linacre: persons with very high OUTFIT statistics are removed and ability estimates from the smaller sample are cross-plotted against the estimates from the full sample. If removal of these persons has little effect on the final estimates, these persons can be retained, because they don't appear to have any impact on the results. That is, they're not driving the results.

If there is a difference, Linacre recommends next examining persons with smaller (but still greater than 2.0) OUTFIT statistics and cross-plotting again. Though there is little guidance on how to define very high and high, in my research, I frequently use an OUTFIT of 3.0 for ‘very high’ and 2.0 for ‘high.’ In my experience, the results of such sensitivity analysis never shows any problem, and I'm able to justify keeping everyone in the sample. This seems to make both reviewers and Rasch purists happy.

Tuesday, April 9, 2019

H is for How to Set Up Your Data File

The exact way you set up your data of course depends on the exact software you use. But my focus today is to give things to think about if/when setting up your data for Rasch analysis.

First, know how your software needs you to format missing values. Many programs will let you simply leave a blank space or cell. Winsteps is fine with a blank space to notate a missing value or skipped question. Facets, on the other hand, will flip out at a blank space and needs a missing value set up (usually I use 9).

Second, ordering of the file is very important, especially if you're working with data from a computer adaptive test, meaning missing values is also important. When someone takes a computer adaptive test, their first item is drawn at random from a set of moderately difficult items. The difficulty of the next item depends on how they did on the first item, but even so, the item is randomly drawn from a set or range of items. So when you set up your data file, you need to be certain that all people who responded to a specific item have that response in the same column (not necessarily where the item was administered numerically in the exam).

This why you need to be meticulously organized with your item bank and give each item an identifier. When you assemble responses for computer adaptive tests, you'll need to reorder people's responses. That is, you'll set up an order for every item in the bank by identifier. When data are compiled, their responses are put in that order, and if a particular item in the bank wasn't administered, there would be a space or missing value there.

Third, be sure you differentiate between item variables and other variables, like person identifiers, demographics, and so on. Once again, know your software. You may find that a piece of software just runs an entire dataset as though all variables are items, meaning you'll get weird results if you have a demographic variable mixed in. Others might let you select certain variables for the analysis and/or categorize variables as items and non-items.

I tend to keep a version of my item set in Excel, with a single variable at the beginning with participant ID number. Excel is really easy to import into most software, and I can simply delete the first column if a particular program doesn't allow non-item variables. If I drop any items (which I'll talk more about tomorrow), I do it from this dataset. A larger dataset, with all items, demographic variables, and so on is kept usually in SPSS, since that's the preferred software at my company (I'm primarily an R user, but I'm the only one and R can read SPSS files directly) in case I ever need to pull in any additional variables for group comparisons. This dataset is essentially the master and any smaller files I need are built from it.

Monday, April 8, 2019

G is for Global Fit Statistics

One of the challenges of Blogging A to Z is that the posts have to go in alphabetical order, even if it would make sense to start with a topic from the middle of the alphabet. It's more like creating the pieces of a puzzle, that could (and maybe should) be put together in a different order than they were created. But like I said, it's a challenge!

So now that we're on letter G, it's time to talk about a topic that I probably would have started with otherwise: what exactly is the Rasch measurement model? Yes, it is a model that allows you to order people according to their abilities (see letter A) and items according to their difficulties. But more that that, it's a prescriptive model of measurement - it contains within it the mathematical properties of a good measure (or at least, one definition of a good measure). This is how Rasch differs from Item Response Theory (IRT) models, though Rasch is often grouped into IRT despite its differences. You see, mathematically, Rasch is not very different from the IRT 1-parameter model, which focuses on item difficulty (and by extension, person ability). But philosophically, it is very different, because while Rasch is prescriptive, IRT is descriptive. If an IRT 1-parameter model doesn't adequately describe the data, you could just select a different IRT model. But Rasch says that the data must fit its model, and it gives you statistics to tell you how well it does. If your data don't fit the model, the deficiency is with the data (and your measure), not the model itself.

Note: This is outside of the scope of this blog series, but in IRT, the second parameter is item discrimination (how well the item differentiates between high and low ability candidates) and the third is the pseudo-guessing parameter (the likelihood you'd get an answer correct based on chance alone). The Rasch model assumes that the item discrimination for all items is 1.0 and does not make any corrections for potential guessing. You know how the SAT penalizes you for wrong answers? It's to discourage guessing. They don't want to you answering a question if you don't know the answer; a lucky guess is not a valid measure. What can I say, man? We psychometricians are a**holes.

When you use Rasch, you're saying that you have a good measure when it gives you data that fits the Rasch model. Poor fit to the Rasch model means you need to rework your measure - perhaps dropping items, collapsing response scales, or noting inconsistencies in person scores that mean their data might not be valid (and could be dropped from the analysis).

For Blogging A to Z 2017, I went through the alphabet of statistics, and for the letter G, I talked about goodness of fit. In Rasch, we look at our global fit statistics to see how well our data fit the prescribed Rasch model. If our data don't fit, we start looking at why and retooling our measure so it does.

The primary global fit statistic we should look at is the log-likelihood chi square statistic, which, as the name implies, is based on the chi square distribution. A significant chi-square statistic in this case means the data significantly differs from the model. Just like in structural equation model, it is a measure of absolute fit.

There are other fit statistics you can look at, such as the Akaike Information Criterion (AIC) and Schwarz Bayesian Information Criterion (BIC). These statistics are used for model comparison (relative fit), where you might test out different Rasch approaches to see what best describes the data (such as a Rating Scale Model versus a Partial Credit Model) or see if changes to the measure (like dropping items) results in better fit. These values are derived from the log-likehood statistic and either degrees of freedom for the AIC or number of non-extreme cases (in Rasch, extreme cases would be those that got every item right or every item wrong) for the BIC. (You can find details and the formulas for AIC and BIC here.)

BIC seems to be the preferred metric here, since it accounts for extreme cases; a measure with lots of extreme cases is not as informative as a measure with few extreme cases, so this metric can help you determine if dropping too easy or too difficult items improves your measure (it probably will, but this lets you quantify that).

Tomorrow, I'll talk about setting up your data for Rasch analysis.

Sunday, April 7, 2019

F is for Facets

So far this month, I've talked about the different things that affect the outcome of measurement - in that it determines how someone will respond. Those things so far would be item difficulty and person ability. How much ability a person has or how much of the trait they possess will affect how they respond to items of varying ability. Each of things that interacts to affect the outcome of Rasch measurement is called a "facet."

But these don't have to be the only facets in a Rasch analysis. You can have additional facets, making for a more complex model. In our content validation studies, we administer a job analysis survey asking people to rate different job-related tasks. As is becoming standard in this industry, we use two different scales for each item, one rating how frequently this task is performed (so we can place more weight on the more frequently performed items) and one rating how critical it is to perform this task competently to protect the public (so we can place more weight on the highly critical items). In this model, we can differentiate between the two scales and see how the scale used changes how people respond. This means that rating scale also becomes a facet, one with two levels: frequency scale and criticality scale.

When conducting a more complex model like this, we need software that can handle these complexities. The people who brought us Winsteps, the software I use primarily for Rasch analysis, also have a program called Facets, which can handle these more complex models.

In a previous blog post, I talked about a facets model I was working with, one with four facets: person ability, item difficulty, rating scale, and timing (whether they received frequency first or second). But one could use a facets model for other types of data, like judge rating data. The great thing about using facets to examine judge data is that one can also partial out concepts like judge leniency; that is, some judges go "easier" on people than others, and a facets models lets you model that leniency. You would just need to have your judges rate more than one person in the set and have some overlap with other judges, similar to the overlap I introduced in the equating post.

This is the thing I love about Rasch measurement, in that it is a unique approach to measurement that can expand in complexity to whatever measurement situation you're presented with. It's all based on the Rasch measurement model, a mathematical model that represents the characteristics a "good measure" should possess - that's what we'll talk about tomorrow when we examine global fit statistics!

Friday, April 5, 2019

E is for Equating

In the course of any exam, new items and even new forms have to be written. The knowledge base changes and new topics become essential. How do we make sure these new items contribute to the overall test score? Through a process called equating.

Typically in Rasch, we equate test items by sprinkling new items in with old items. When we run our Rasch analysis on items and people, we "anchor" the old items to their already established item difficulties. When item difficulties are calculated for the new items, they are now on the exact same metric as the previous items, and new difficulties are established relative to the ones that have already been set through pretesting.

It's not even necessary for everyone to receive all pretest items - or even all of the old items. You just need enough overlap to create links between old and new items. In fact, when you run data from a computer adaptive test, there are a lot "holes" in the data, creating a sparse matrix.

In the example above, few examinees received the exact same combination of items, but with an entire dataset that looks like this (and more examinees, of course), we could estimate item difficulties for new items in the set.

But you may also ask about the "old" items - do we always stick with the same difficulties or do we change those from time to time? Whenever you anchor items to existing item difficulties, the program will still estimate fresh item difficulties and let you examine something called "displacement": how much the newly estimated difficulty differs from the anchored one. You want to look at these and make sure you're not experiencing what's called "item drift," which happens when an item becomes easier or harder over time.

This definitely happens. Knowledge that might have previously been considered difficult can become easier over time. Math is a great example. Many of the math classes my mom took in high school and college (and that may have been electives) were offered to me in middle school or junior high (and were mandatory). Advances in teaching, as well as better understanding of how these concepts work, can make certain concepts easier.

On the flipside, some items could get more difficult over time. My mom was required to take Latin in her college studies. I know a little "church Latin" but little else. Items about Latin would have been easier for my mom's generation, when the language was still taught, than mine. And if we could hop in a time machine and go back to when scientific writing was in Latin (and therefore, to be a scientist, you had to speak/write Latin fluently), these items would be even easier for people.

Essentially, even though Rasch offers an objective method of measurement and a way to quantify item difficulty, difficulty is still a relative concept and can change over time. Equating is just one part of the type of testing you must regularly do with your items to maintain a test or measure.

Thursday, April 4, 2019

D is for Dimensionality

We're now 4 posts into Blogging A to Z, and I haven't really talked much about the assumptions of Rasch. That is, like any statistical test, there are certain assumptions you must meet in order for results to be valid. The same is true of Rasch. One of the key assumptions of Rasch is that the items all measure the same thing - the same latent variable. Put another way, your measure should be unidimensional: only assessing one dimension.

This is because, in order for the resulting item difficulties and person abilities to be valid - and comparable to each other - they have to be assessing the same thing. It wouldn't make sense to compare one person's math test score to a reading test score. And it wouldn't make sense to combine math and reading items into the same test; what exactly does the exam measure?

But the assumption of unidimensionality goes even further than that. You also want to be careful that each individual item is only measuring one thing. This is harder than it sounds. While a certain reading ability is needed for just about any test question, and some items will need to use jargon, a math item written at an unnecessarily high reading level is actually measuring two dimensions: math ability and reading ability. The same is true for poorly written items that give clues as to the correct answer or trick questions that trip people up even when they have the knowledge. Your test is not only of ability on the test topic, but a second ability: test savviness. The only thing that impacts whether a person gets an item correct should be their ability level in that domain. That is an assumption we have to make for a measurement to be valid. That is, if a person with high math ability gets an item incorrect because of low reading ability, we've violated that assumption. And if a person with low ability gets an item correct because there was "all of the above" option, we've violated that assumption.

How do we assess dimensionality? One way is with principal components analysis (PCA), which I've blogged about before. As a demonstration, I combined two separate measures (the Satisfaction with Life Scale and the Center for Epidemiologic Studies Depression measure) and ran them through a Rasch analysis as though they were a single measure. The results couldn't have been more perfect if I tried. The PCA results showed 2 clear factors - one made up of the 5 SWLS items and one made up of the 16 CESD items. Here's some of the output Winsteps gave me for the PCA. The first looks at explained variance and Eigenvalues:

There are two things to look at when running a PCA for a Rasch measure. 1. The variance explained by the measures should be at least 40%. 2. The Eigenvalue for the first contrast should be less than 2. As you can see, the Eigenvalue right next to the text "Unexplned variance in 1st contrast" is 7.76, while the values for the 2nd contrast (and on) are less than 2. That means we have two dimensions in the data - and those dimensions are separated out when we scroll down to look at the results of the first contrast. Here are those results in visual form - the letters in the plot refer to the items.

You're looking for two clusters of items, which we have in the upper left-hand corner and the lower right-hand corner. The next part identifies which items the letters refer to and gives the factor loadings:

The 5 items on the left side of the table are from the SWLS. The 16 on the right are all from the CESD. Like I said, perfect results.

But remember that PCA is entirely data-driven, so it could identify additional factors that don't have any meaning beyond that specific sample. So the PCA shouldn't be the only piece of evidence for unidimensionality, and there might be cases where you forgo that analysis altogether. If you've conducted a content validation study, that can be considered evidence that these items all belong on the same measure and assess the same thing. That's because content validation studies are often based on expert feedback, surveys of people who work in the topic area being assessed, job descriptions, and textbooks. Combining and cross-validating these different sources can be much stronger evidence than an analysis that is entirely data dependent.

But the other question is, how do we ensure all of our items are unidimensional? Have clear rules for how items should be written. Avoid "all of the above" or "none of the above" as answers - they're freebie points, because if they're an option, they're usually the right one. And if you just want to give away freebie points, why even have a test at all? Also make clear rules about what terms and jargon can be used in the test, and run readability analysis on your items - ignore those required terms and jargon and try to make the reading level of everything else as consistent across the exam as possible (and as low as you can get away with). When I was working on creating trait measures in past research, the guidelines from our IRB were usually that measures should be written at a 6th-8th grade reading level. And unless a measure is of reading level, that's probably a good guideline to try to follow.

Tomorrow, we'll talk about equating test items!

Tuesday, April 2, 2019

B is for Bank

As I alluded to yesterday, in Rasch, every item gets a difficulty and every person taking (some set of) those items gets an ability. They're both on the same scale, so you can compare the two, and determine which item is right for the person based on their ability and which ability is right for the person based on how they perform on the items. Rasch uses a form of maximum likelihood estimation to create these values, so it goes back and forth with different values until it gets a set of item difficulties and person abilities that fit the data.

Once you have a set of items that have been thoroughly tested and have item difficulties, you can begin building an item bank. A bank could be all the items on a given test (so the measure itself is the bank), but usually, when we refer to a bank, we're talking about a large pool of items from which to draw for a particular administration. This is how computer adaptive tests work. No person is going to see every since item in the bank.

Maintaining an item bank is a little like being a project manager. You want to be very organized and make sure you include all of the important information about the items. Item difficulty is one of the statistics you'll want to save about the items in your bank. When you administer a computer adaptive test, the difficulty of the item the person receives next is based on whether they got the previous item correct or not. If they got the item right, they get a harder item. If they got it wrong, they get an easier item. They keep going back and forth like this until we've administered enough items to be able to estimate that person's ability.

You want to maintain the text of the item: the stem, the response options, and which option is correct (the key). The bank should also include what topics each item covers. You'll have different items that cover different topics. On ability tests, those are the different areas people should know on that topic. An algebra test might have topics like linear equations, quadratic equations, word problems, and so on.

With adaptive tests, you'll want to note which items are enemies of each other. These are items that are so similar (they cover the same topic and may even have the same response options) that you'd never want to administer both in the same test. Not only do enemy items give you no new information on that person's ability to answer that type of question, they may even provide clues about each other that leads someone to the correct answer. This is bad, because the ability estimate based on this performance won't be valid - the ability you get becomes a measure of how savvy the test taker is, rather than their actual ability on the topic of the test.

On the flip side, there might be items that go together, such as a set of items that all deal with the same reading passage. So if an examinee randomly receives a certain passage of text, you'd want to make sure they then receive all items associated with that passage.

Your bank might also include items you're testing out. While some test developers will pretest items through a special event, where examinees are only receiving new items with no item difficulties, once a test has been developed, new items are usually mixed in with old items to gather item difficulty data, and to calibrate the difficulties of the new items based on difficulties of known items. (I'll talk more about this when I talk about equating.) The new items are pretest and considered unscored - you don't want performance on them to affect the person's score. The old items are operational and scored. Some tests put all pretest items in a single section; this is how the GRE does it. Others will mix them in throughout the test.

Over time, items are seen more and more. You might reach a point where an item has been seen so much that you have to assume it's been compromised - shared between examinees. You'll want to track how long an item has been in the bank and how many times it's been seen by examinees. In this case, you'll retire an item. You'll often keep retired items in the bank, just in case you want to reuse or revisit them later. Of course, if the item tests on something that is no longer relevant to practice, it may be removed from the bank completely.

Basically, your bank holds all items that could or should appear on a particular test. If you've ever been a teacher or professor and received supplementary materials with a test book, you may have gotten items that go along with the text. These are item banks. As you look through them, you might notice similarly worded items (enemy items), topic in the textbook covered, and so on. Sadly, these types of banks don't tend to have item difficulties attached to them. (Wouldn't that be nice, though? Even if you're not going to Rasch grade your students, you could make sure you have a mix of easy, medium, and hard items. But these banks don't tend to have been pretested, a fact I learned when I applied for a job to write many of these supplementary materials for textbooks.)

If you're in the licensing or credentialing world, you probably also need to track references for each item. And this will usually need to be a very specific reference, down to the page of a textbook or industry document. You may be called upon - possibly in a deposition or court - to prove that a particular item is part of practice, meaning you have to demonstrate where this item comes from and how it is required knowledge for the field. Believe me, people do challenge particular items. In the comments section at the end of a computer adaptive test, people will often reference specific items. It's rare to have to defend an item to someone, but when you do, you want to make certain you have all the information you need in one place.

Most of the examples I've used in this post have been for ability tests, and that's traditionally where we've seen item banks. But there have been some movements in testing to use banking for trait and similar tests. If I'm administering a measure of mobility to a person with a spinal cord injury, I may want to select certain items based on where that person falls on mobility. Spinal cord injury can range in severity, so while one person with a spinal cord injury might be wheelchair-bound, another might be able to walk short distances with the help of a walker (especially if they have a partial injury, meaning some signals are still getting through). You could have an item bank so that these two patients get different items; the person who is entirely wheelchair-bound wouldn't get any items about their ability to walk, while the person with the partial injury would. The computer adaptive test would just need a starting item to figure out approximately where the person falls on the continuum of mobility, then direct them to the questions most relevant to their area of the continuum.

Tomorrow, we'll talk more about the rating scale model of Rasch, and how we look at category function!

Monday, April 1, 2019

A is for Ability

Welcome to April A to Z, where I'll go through the A to Z of Rasch! This is the measurement model I use most frequently at work, and is a great way to develop a variety of measures. It started off in educational measurement, and many Rasch departments are housed in educational psychology and statistics departments. But Rasch is slowly making its way into other disciplines as well, and I hope to see more people using this measurement model. It provides some very powerful and useful statistics on measures and isn't really that difficult to learn.

A few disclaimers:

There are entire courses on Rasch measurement, and in fact, some programs divide Rasch up across several courses because there's much that can be learned on the topic. This blog series is really an introduction to the concepts, to help you get started and decide if Rasch is right for you. I won't get into the math behind it as much, but will try to use some data examples to demonstrate the concepts.
I try to be very careful about what data I present on the blog. My data examples are usually: data I personally own (that is, collected as part of my masters or doctoral study), publicly available data, or (most often) simulated data. None of the data I present will come from any of the exams I work on at my current position or past positions as a psychometrician. I don't own those data and can't share them.
Finally, these posts are part of a series in which I build on past posts to introduce new concepts. I won't always be able to say everything I want on a topic for a given post, because it ties into another post better. So keep reading - I'll try to make it clear when I'll get back to something later.

For my first post in the series, ability!

When we use measurement with people, we're trying to learn more about some underlying quality about them. We call this underlying quality we want to measure a "latent variable." It's something that can't be captured directly, but rather through proxies. In psychometrics, those proxies are the items. We go through multiple steps to design and test those items to ensure they get as close as we can to tapping into that latent variable. In science/research terms, the items are how we operationalize the latent variable: define it in a way that it can be measured.

In the Rasch approach to psychometrics, we tend to use the term "ability" rather than "latent variable." First, Rasch was originally designed for educational assessment - tests that had clear right and wrong answers - so it makes sense to frame performance on these tests as ability. Second, Rasch deals with probabilities that a person is able to answer a certain question correctly or endorse a certain answer on a test. So even for measures of traits, like attitudes and beliefs or personality inventories, people with more of a trait have a different ability to respond to different questions. Certain answers are easier for them to give because of that underlying trait.

In Rasch, we calibrate items in terms of difficulty (either how hard the question is or, for trait measures, how much of the trait is needed to respond in a certain way) and people in terms of ability. These two concepts are calibrated on the same scale, so once we have a person's ability, we can immediately determine how they're likely to respond to a given item. That scale is in a unit called a "logit," or a log odds ratio. This conversion gives us a distribution that is nearly linear. (Check out the graphs in the linked log odds ratio post to see what I mean.) Typically, when you look at your distribution of person ability measures, you'll see numbers ranging from positive to negative. And while, theoretically, your logits can range from negative infinity to positive infinity, more likely, you'll see values from -2 to +2 or something like that.

That ability metric is calculated based on the items the person responded to (and especially how they responded) - for exams with right and wrong answers, their ability is the difficulty of item at which they have a 50% chance of responding correctly. The actual analysis involved in creating these ability estimates uses some form of maximum likelihood estimation (there are different types, like Joint Maximum Likelihood Estimation, JMLE, or Pairwise Maximum Likelihood Estimation, PMLE, to name a couple), so it goes back and forth with different values until it gets estimates that best fit the data. This is both a strength and weakness of Rasch: it's an incredibly powerful analysis technique that makes full use of the data available, and handles missing data beautifully, but it couldn't possibly be done without a computer and capable software. In fact, Rasch has been around almost as long as classical test theory approaches - it just couldn't really be adopted until technology caught up.

I'll come back to this concept later this month, but before I close on this post, I want to talk about one more thing: scores. When you administer a Rasch exam and compute person ability scores, you have logits that are very useful for psychometric purposes. But rarely would you ever show those logit scores to someone else, especially an examinee. A score of -2 on a licensing exam or the SAT isn't going to make a lot of sense to an examinee. So we convert those values to a scaled score, using some form of linear equation. That precise equation differs by exam. The SAT, for instance, ranges from 400 to 1600, the ACT from 1 to 36, and the CPA exam from 0 to 99. Each of these organizations has an equation that takes their ability scores and converts them to their scaled score metric. (And in case you're wondering if there's a way to back convert your scaled score on one of these exams to the logit, you'd need to know the slope and constant of the equation they use - not something they're willing to share, nor something you could determine from a single data point. They don't really want you looking under the hood on these types of things.)

Some questions you may have:

The linear equation mentioned above takes into account the possible range of scaled scores (such as 400 to 1600 on the SAT), as well as the actual abilities of the pilot or pretest sample you used. What if you administer the test (with the set difficulties of the items from pretesting), and you get someone outside of the range of abilities of the sample you used to create the equation? Could you get someone with a score below 400 or above 1600? Yes, you could. This is part of why you want a broad range of abilities in your pilot sample, to make sure you're getting all possible outcomes and get the most accurate equation you can. This is also where simulated data could be used, which we do use in psychometrics. Once you have those item difficulties set with pilot testing, you could then create cases that get, for instance, all questions wrong or all questions right. This would give you the lowest possible logit and the highest possible logit. You can then include those values when setting the equation. As long as the difficulties are based on real performance, it's okay to use simulated cases. Finally, that range of possible scores sets the minimum and maximum for reporting. Even if an ability score means a person could have an SAT below 400, their score will be automatically set at the minimum.
What about passing and failing for licensing and credentialing exams? How do ability scores figure into that? Shouldn't you just use questions people should know and forget about all this item difficulty and person ability nonsense? This is one way that Rasch and similar measurement models differ from other test approaches. The purpose of a Rasch test is to measure the person's ability, so you need a good range of item difficulties to determine what a person's true ability is. These values are not intended to tell you whether a person should pass or fail - that's a separate part of the psychometric process called standard setting, where a committee of subject matter experts determines how much ability is necessary to say a person has the required knowledge to be licensed or credentialed. This might be why Rasch makes more sense to people when you talk about it in the educational assessment sense: you care about what the person's ability is. But Rasch is just as useful in pass/fail exams, because of the wealth of information it gives you about your exam and items. You just need that extra step of setting a standard to give pass/fail information. And even in those educational assessments, there are generally standards, such as what a person's ability should be at their age or grade level. Rasch measurement gives you a range of abilities in your sample, which you can then use to make determinations about what those abilities should be.

Tomorrow, we'll dig more into item difficulties when we discuss item banks!

Sunday, March 24, 2019

Statistics Sunday: Blogging A to Z Theme Reveal

I'm excited to announce this year's theme for the Blogging A to Z challenge:

I'll be writing through the alphabet of psychometrics with the Rasch Measurement Model approach. I've written a bit about Rasch previously. You can find those posts here:

A Great Minds in Statistics post about Georg Rasch
Factor Analysis for Psychometrics
Principal Components Analysis (a related approach frequently used in Rasch)
The Log Odds Ratio (a key concept that forms the basis of Rasch measurement)
Running Complex Models in Rasch
How I Became a Psychometrician
Dealing with Missing Data (which Rasch handles beautifully)
A previous A to Z post on Ordinal Variables

Looking forward to sharing these posts! First up is A for Ability on Monday, April 1!

Sunday, March 17, 2019

Statistics Sunday: Standardized Tests in Light of Public Scandal

No doubt, by now, you've heard about the large-scale investigation into college admissions scandals among the wealthy - a scandal that suggests SAT scores, among other things, can in essence be bought. Eliza Shapiro and Dana Goldstein of the NY Times ask if this scandal is "the last straw" for tests like the SAT.

To clarify in advance, I do not nor have I ever worked for the Educational Testing Service or for any organization involved in admissions testing. But as a psychometrician, I have a vested interest in this industry. And I became a psychometrician because of my philosophy: that many things, including ability, achievement, and college preparedness, can be objectively measured if certain procedures and methods are followed. If the methods and procedures are not followed properly in a particular case, the measurement in that case is invalid. That is what happens when a student (or more likely, their parent) pays someone else to take the SAT for them, or bribes a proctor, or finds an "expert" willing to sign off on a disability the student does not have to get extra accommodations.

But because that particular instance of measurement is invalid doesn't damn the entire field to invalidity. It just means we have to work harder. Better vetting of proctors, advances in testing like computerized adaptive testing and new item types... all of this is to help counteract outside variables that threaten the validity of measurement. And expansions in the field of data forensics now include examining anomalous patterns in testing, to identify if some form of dishonesty has taken place - allowing scores to be rescinded or otherwise declared invalid after the fact.

This is a field I feel strongly about, and as I said, really sums up my philosophy in life for the value of measurement. Today, I'm on my way to the Association of Test Publishers Innovations in Testing 2019 meeting in Orlando. I'm certain this recent scandal will be a frequent topic at the conference, and a rallying cry for better protection of exam material and better methods for identifying suspicious testing behavior. Public trust in our field is on the line. It is our job to regain that trust.

Tuesday, March 12, 2019

Are Likert Scales Superior to Yes/No? Maybe

I stumbled upon this great post from the Personality Interest Group and Espresso (PIG-E) blog about which is better - Likert scales (such as those 5-point Agree to Disagree scales you often see) or Yes/No (see also True/False)? First, they polled people on Twitter. 66% of respondents thought that going from a 7-point to 2-point scale would decrease reliability on a Big Five personality measure; 71% thought that move would decrease validity. But then things got interesting:

Before I could dig into my data vault, M. Brent Donnellan (MBD) popped up on the twitter thread and forwarded amazingly ideal data for putting the scale option question to the test. He’d collected a lot of data varying the number of scale options from 7 points all the way down to 2 points using the BFI2. He also asked a few questions that could be used as interesting criterion-related validity tests including gender, self-esteem, life satisfaction and age. The sample consisted of folks from a Qualtrics panel with approximately 215 people per group.

Here are the average internal consistencies (i.e., coefficient alphas) for 2-point (Agree/Disagree), 3-point, 5-point, and 7-point scales:

And here's what they found in terms of validity evidence - the correlation between the BFI2 and another Big Five measure, the Mini-IPIP:

FYI, when I'm examining item independence in scales I'm creating or supporting, I often use 0.7 as a cut-off - that is, items that correlate at 0.7 or higher (meaning 49% shared variance) are essentially measuring the same thing and violate the assumption of independence. The fact that all but Agreeableness correlates at or above 0.7 is pretty strong evidence that the scales, regardless of number of response options, are measuring the same thing.

The post includes a discussion of these issues by personality researchers, and includes some interesting information not just on number of response options, but also on the Big Five personality traits.

Thursday, March 7, 2019

Time to Blog More

My blogging has been pretty much non-existent this year. Without getting too personal, I've been going through some pretty major life changes, and it's been difficult to focus on a variety of things, especially writing. As I work through this big transition, I'm thinking about what things I want to make time for and what things I should step away from.

Writing - especially about science, statistics, and psychometrics - remains very important to me. So I'm going to keep working to get back into some good blogging habits. Statistics Sunday posts may remain sporadic for a bit longer, but look for more statistics-themed posts very soon because...

That's right, it's time to sign up for the April A to Z blogging challenge! I'll officially announce my theme later this month, but for now I promise it will be stats-related.

Thursday, January 24, 2019

Long Time, No Write

Wow, it's been way too long since I've posted anything! Lots of life changes recently, including moving to a new place and fighting with Comcast to get internet there. I still need to set up my office, and I plan on doing lots of writing in my new dedicated space. I'm planning on more statistics posts and a few more surprises this year.

Work has also been busy. At the moment:

I'm working on three content validation studies, including analyzing data for two job analysis surveys and gearing up for a third
I've wrapped up the first phase of analysis on our salary and satisfaction survey, and have some cool analysis planned for phase 2
I finished a time study on our national and state exams
I'm awaiting feedback on the first draft of a chapter about standard setting I coauthored with some coworkers
I'm learning how to be a supervisor, now that I have someone working for me! That's right, I'm no longer a department of one

Once I get through some of the most pressing project work, I'm going to take some of my work time to teach myself data forensics as it applies to the testing. In fact, this book has been on my to-read shelf since my annual employee evaluation back in November. Look for blog posts on that!

Sunday, October 14, 2018

Statistics Sunday: Some Psychometric Tricks in R

Statistics Sunday: Some Psychometrics Tricks in R It's been a long time since I've posted a Statistics Sunday post! Now that I'm moved out of my apartment and into my house, I have a bit more time on my hands, but work has been quite busy. Today, I'm preparing for 2 upcoming standard-setting studies by drawing a sample of items from 2 of our exams. So I thought I'd share what I'm up to in order to pass on some of these new psychometric tricks I've learned to help me with this project.

Because I can't share data from our item banks, I'll generate a fake dataset to use in my demonstration. For the exams I'm using for my upcoming standard setting, I want to draw a large sample of items, stratified by both item difficulty (so that I have a range of items across the Rasch difficulties) and item domain (the topic from the exam outline that is assessed by that item). Let's pretend I have an exam with 3 domains, and a bank of 600 items. I can generate that data like this:

domain1 <- data.frame(domain = 1, b = sort(rnorm(200)))
domain2 <- data.frame(domain = 2, b = sort(rnorm(200)))
domain3 <- data.frame(domain = 3, b = sort(rnorm(200)))

The variable domain is the domain label, and b is the item difficulty. I decided to sort that variable within each dataset so I can easily see that it goes across a range of difficulties, both positive and negative.

head(domain1)

##   domain         b
## 1      1 -2.599194
## 2      1 -2.130286
## 3      1 -2.041127
## 4      1 -1.990036
## 5      1 -1.811251
## 6      1 -1.745899

tail(domain1)

##     domain        b
## 195      1 1.934733
## 196      1 1.953235
## 197      1 2.108284
## 198      1 2.357364
## 199      1 2.384353
## 200      1 2.699168

If I desire, I can easily combine these 3 datasets into 1:

item_difficulties <- rbind(domain1, domain2, domain3)

I can also easily visualize my item difficulties, by domain, as a group of histograms using ggplot2:

library(tidyverse)

item_difficulties %>%
  ggplot(aes(b)) +
  geom_histogram(show.legend = FALSE) +
  labs(x = "Item Difficulty", y = "Number of Items") +
  facet_wrap(~domain, ncol = 1, scales = "free") +
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let's say I want to draw 100 items from my item bank, and I want them to be stratified by difficulty and by domain. I'd like my sample to range across the potential item difficulties fairly equally, but I want my sample of items to be weighted by the percentages from the exam outline. That is, let's say I have an outline that says for each exam: 24% of items should come from domain 1, 48% from domain 2, and 28% from domain 3. So I want to draw 24 from domain1, 48 from domain2, and 28 from domain3. Drawing such a random sample is pretty easy, but I also want to make sure I get items that are very easy, very hard, and all the levels in between.

I'll be honest: I had trouble figuring out the best way to do this with a continuous variable. Instead, I decided to classify items by quartile, then drew an equal number of items from each quartile.

To categorize by quartile, I used the following code:

domain1 <- within(domain1, quartile <- as.integer(cut(b, quantile(b, probs = 0:4/4), include.lowest = TRUE)))

The code uses the quantile command, which you may remember from my post on quantile regression. The nice thing about using quantiles is that I can define that however I wish. So I didn't have to divide my items into quartiles (groups of 4); I could have divided them up into more or fewer groups as I saw fit. To aid in drawing samples across domains of varying percentages, I'd probably want to pick a quantile that is a common multiple of the domain percentages. In this case, I purposefully designed the outline so that 4 was a common multiple.

To draw my sample, I'll use the sampling library (which you'll want to install with install.packages("sampling") if you've never done so before), and the strata function.

library(sampling)
domain1_samp <- strata(domain1, "quartile", size = rep(6, 4), method = "srswor")

The resulting data frame has 4 variables - the quartile value (since that was used for stratification), the ID_unit (row number from the original dataset), probability of being selected (in this case equal, since I requested equally-sized strata), and stratum number. So I would want to merge my item difficulties into this dataset, as well as any identifiers I have so that I can pull the correct items. (For the time being, we'll just pretend row number is the identifier, though this is likely not the case for large item banks.)

domain1$ID_unit <- as.numeric(row.names(domain1))
domain1_samp <- domain1_samp %>%
  left_join(domain1, by = "ID_unit")
qplot(domain1_samp$b)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For my upcoming study, my sampling technique is a bit more nuanced, but this gives a nice starting point and introduction to what I'm doing.

Sunday, September 9, 2018

Statistics Sunday: What is Standard Setting?

In a past post, I talked about content validation studies, a big part of my job. Today, I'm going to give a quick overview of standard setting, another big part of my job, and an important step in many testing applications.

In any kind of ability testing application, items will be written with identified correct and incorrect answers. This means you can generate overall scores for your examinees, whether the raw score is simply the number of correct answers or generated with some kind of item response theory/Rasch model. But what isn't necessarily obvious is how to use those scores to categorize candidates and, in credentialing and similar applications, who should pass and who should fail.

This is the purpose of standard setting: to identify cut scores for different categories, such as pass/fail, basic/proficient/advanced, and so on.

There are many different methods for conducting standard setting. Overall, approaches can be thought of as item-based or holistic/test-based.

For item-based methods, standard setting committee members go through each item and categorize it in some way (the precise way depends on which method is being used). For instance, they may categorize it as basic, proficient, or advanced, or they may generate the likelihood that a minimally qualified candidate (i.e., the person who should pass) would get it right.

For holistic/test-based methods, committee members make decisions about cut scores within the context of the whole test. Holistic/test-based methods still require review of the entire exam, but don't require individual judgments about each item. For instance, committee members may have a booklet containing all items in order of difficulty (based on pretest data), and place a bookmark at the item that reflects the transition from proficient to advanced or from fail to pass.

The importance of standard setting comes down to defensibility. In licensure, for instance, failing a test may mean being unable to work in one's field at all. For this reason, definitions of who should pass and who should fail (in terms of knowledge, skills, and abilities) should be very strong and clearly tied to exam scores. And licensure and credentialing organizations are frequently required to prove, in a court of law, that their standards are fair, rigorously derived, and meaningful.

For my friends and readers in academic settings, this step may seem unnecessary. After all, you can easily categorize students into A, B, C, D, and F with the percentage of items correct. But this is simply a standard (that is, the cut score for pass/fail is 60%), set at some point in the past, and applied through academia.

I'm currently working on a chapter on standard setting with my boss and a coworker. And for anyone wanting to learn more about standard setting, two great books are Cizek and Bunch's Standard Setting and Zieky, Perie, and Livingston's Cut Scores.