Thursday, February 22, 2018

Statistical Sins: Overselling Automation

Yesterday, I blogged about a talk I attended at the ATP 2018 meeting, the topic of which was whether psychometricians could be replaced by AI. The consensus seemed to be that automation, where possible, is good. It frees up time for people to focus their energies on more demanding tasks, while farming out rule-based, repetitive tasks to various forms of technology. And there are many instances where automation is the best, most consistent way to achieve a desired outcome. At my current job, I inherited a process: score and enter results from a professional development program. Though the process of getting final scores and pass/fail status into our database was automated, the process to get there involved lots of clicking around: re-scoring variables, manually deleting columns, and so forth.

Following the process would take a few hours. Instead, after going through it the first time, I decided to devote half a day or so to automating the process. Yes, I spent more time writing the code and testing it than I would have if I'd just gone through the process itself. And that is presumably why it was never automated before now; the process, after all, only occurs once a month. But I'd happily take a one-time commitment of 4-5 hours, than a once-a-month commitment of 3. The code has been written, fully tested, and updated. Today, I ran that process in about 15 minutes, squeezing it between two meetings.

And there are certainly other ways we've automated testing processes for the better. Multiple speakers at the conference discussed the benefits of computer adaptive testing. Adaptive testing means that the precise set of items presented to an examinee is determined by the examinee's performance. If the examinee gets an item correct, they get a harder item; if incorrect, they get an easier item. Many cognitive ability tests - the currently accepted term for what was once called "intelligence tests" - are adaptive, and the examiner selects a starting question based on assumed examinee ability, then moves forward (harder items) or backward (easier items) depending on the examinee's performance. This allows the examiner to pinpoint the examinee's ability, in fewer items than fixed form exams.

While cognitive ability exams (like the Wechsler Adult Intelligence Scale) are still mostly presented as individually-administered adaptive exams, test developers discovered they could use these same adaptive techniques on multiple choice exams. But you wouldn't want to have a examiner sit down with each examinee and adapt their multiple choice exam; you can just have a computer do it for you. As many presenters stated during the conference, you can obtain accurate estimates of a person's ability in about half the items when using a computer adaptive test (CAT).

But CAT isn't a great solution to every testing problem, and this was one thing I found frustrating, because some presenters expressed frustration that CAT wasn't being utilized as much as it could. They speculated this was due to discomfort with the technology, rather than a thoughtful, conscious decision not to use CAT. This is a very important distinction and I suspect it is the case far more often that test develops use paper-and-pencil over CAT because it's the better option in their situation.

Like I said, the way CAT works is that the next item administered is determined by examinee performance on the previous item. The computer will usually start with an item of moderate difficulty. If the examinee is correct, they get a slightly harder item; if incorrect, a slightly easier item. Score on the exam is determined by the difficulty of items the examinee answered correctly. This means you need to have items across a wide range of abilities.

"Okay," you might say, "that's not too hard."

You also need to make sure you have items covering all topics from the exam.

At a wide range of difficulties.

And drawn at random from a pool, since you don't want everyone of a certain ability level to get the exact same items; you want to limit how much individual items are exposed to help deter cheating.

This means your item pool has to be large - potentially 1000s of items, and you'll want to roll-in and roll-out items as they get out-of-date or over-exposed. This isn't always possible, especially for smaller test development outfits or newer exams. At my current job, all of our exams are computer-administered, but only about half of them are CAT. While it's a goal to make all exams CAT, some of our item banks just aren't large enough yet, and it's going to take a long time and a lot of work to get there.

Of course, there's also the cost of setting up CAT - there are obviously equipment needs and securing (i.e., preventing cheating) a CAT environment requires attention to different factors than securing a paper-and-pencil testing environment. All of that costs money, which might be prohibitive for some organizations on its own.

Automation is good and useful, but it can't always be done. Just because something works well - and better than the alternative - in theory doesn't mean it can always be applied in practice. Context matters.


  1. In CAT, as in other domains of automation, it seems to come down to a calculation of cost and yield. While I also personally prefer 5 hours of coding than doing 2 hours of repetitive work and have automated a great deal of everyday tasks, the cost of a thorough CAT seems very high for many test situations.

    You do not only have to devise thousands of questions with varying difficulty and area of expertise, you also have to pretest them and adjust and monitor the CAT environment to make sensitive choices. The difficulty of each item has to be determined objectively in surveys (OK, AmTurk may help here) and continuously monitored by the system. As soon as an item gets overexposed through mass medial publication or viral dissemination, its difficulty and that of similar items drops.
    IMO, the high costs of setting up and monitoring the system is only justified for tests which are insensively used for months or years and would strongly benefit from shorter questionnaires (such as widely used tests for the ability of children or persons with problems concentrating through a 20-page questionnaire).
    For other domains, I prefer a standardized test for all participants where everyone gets the same items in roughly the same order for the sake of comparability and reliability.

    1. Yes, Martin, you're absolutely right that items must be pretested. However, this is true regardless of how you're administering your exam or whether it is an adaptive or nonadaptive test. In the credentialing world where I work, or any high-stakes testing situation, this step is essential and must be done rigorously. We have to justify everything that we do, not only for auditors from accreditation agencies, but also internally to protect ourselves from lawsuits.

      There are different ways to approach this issue of pretesting, of course, but as with many methodologies, some approaches are more justifiable than others. At my organization, we include pretesting items with each exam administration. When our item writing committee convenes, they review the item statistics from pretest items and determine which ones should become operational (i.e., added to the item bank or fixed form exam).

      There are so many nuances in psychometrics - it's so difficult to fit all of it in, without blog posts approaching novel-length. Thanks so much for reading and sharing your thoughts!