## Sunday, December 3, 2017

### Statistics Sunday: Practice Effects and Modern Testing Approaches

In their ground-breaking book, Cook and Campbell introduced us to threats to validity. Remember that validity refers to truth and comes in a four flavors: internal, external, construct, and statistical conclusion. You can learn more about validity at the link, above but as a brief refresher:

• Internal validity - The effects (dependent variable) observed in the study are caused by the independent variable. Maximizing internal validity means isolating these two variables, so that you can show a true causal relationship between them.
• External validity - The findings of the study can be generalized. The more control you have over the situation, the higher your internal validity; but this results in lower external validity, because it's difficult to generalize from a highly controlled environment to a less controlled environment.
• Construct validity - The variables measured in the study actually represent the underlying constructs. We can't hold a tape measure up to your brain to find out your cognitive ability; we have to give you a measure of cognitive ability, which may or may not truly measure the underlying construct.
• Statistical conclusion validity - The statistical analyses used to draw conclusions have been correctly applied and interpreted. If, for instance, you don't quite meet the assumptions of a test, you weaken your statistical conclusion validity. (That doesn't mean your findings aren't true, just that the probability that they're true is lower than if you fully met the assumptions.)
It would be impossible to design a study that maximizes all four types of validity. You could probably maximize a couple of them at once, but internal and external validity, for instance, involve trade-offs. And any methodological approach you take is going to impact one or more types of validity. These aspects that decrease a type of validity are called threats to validity.

Usually, what we want to get at in our study is to establish a causal relationship between two things. At least, any of us from a field that focuses on experimentation (where independent variables can be manipulated) is interested in establishing causal relationships. We have different methods that we use to try to establish cause. One way that we do this is by taking a sample of people, randomly assigning them to some level of our independent variable and measuring the effect of the IV on the dependent variable.

The problem with this approach, of course, is that the people in the 2 or more experimental groups are different from each other. We do many things to try to equalize groups, but we can never truly know our groups are equivalent.

So another way to handle this is by having one group of people, delivering the different levels of the IV in a randomized order, and measuring the dependent variable after delivering each IV. The problem with this approach is that we could have carryover effects. We don't know if the dependent variable we observe is due to the intervention they just received, or the one they received before that. We can't wipe a person's memory between each segment.

In fact, any time you expose a person to the same measure more than once, you're going to see differences in scores, due to practice. If your intervention is meant to improve performance, simply being exposed to a measure will result in improvements regardless of whether the person received some intervention.

However, there could be a way around this using modern testing approaches, specifically computer adaptive testing (CAT). I'm planning on writing a longer post describing how CAT work, but the short answer is that CATs determine the next item based on your response to the previous item. If you get the item correct, you get a more difficult item. If you get the item incorrect, you get an easier item.

CATs also use large item banks, so the easy item you receive might be totally different than the easy item I receive, even if they are at the same level of difficulty. What this means is that you're highly unlikely to see the same item twice. And if your ability goes up (or down), you are really unlikely to see the same item twice. That's not to say you might not still observe practice effects, but CAT helps reduce those effects.

Obviously, CAT can't be used for everything and to use it requires many things like access to computers (which may limit how many people you can test at once, depending on available resources), ability to install programs on such computers to deliver CATs, and a large bank of items. But I imagine, over time, as CAT becomes more and more common, we're liable to see more studies using it.

What has been your experience with computer adaptive testing? Or practice effects?