Wednesday, February 28, 2018

Statistical Sins: Sensitive Items

Part of my job includes developing and fielding surveys, which we use to gather data that informs our exam efforts and even content. Survey design was a big part of my graduate and postdoctoral training, and survey is a frequently used methodology in many research institutions. Which is why it is so disheartening to watch the slow implosion of the Census Bureau under the Trump administration.

Now, the Bureau is talk about adding an item about citizenship to the Census - that is, an item asking a person whether they are a legal citizen of the US - which the former director calls "a tremendous risk."

You can say that again.

The explanation makes it at least sound like it is being suggested with good intentions:
In December, the Department of Justice sent a letter to the Census Bureau asking that it reinstate a question on citizenship to the 2020 census. “This data is critical to the Department’s enforcement of Section 2 of the Voting Rights Act and its important protections against racial discrimination in voting,” the department said in a letter. “To fully enforce those requirements, the Department needs a reliable calculation of the citizen voting-age population in localities where voting rights violations are alleged or suspected.”
But regardless of the reasoning behind it, this item is a bad idea. In surveys, this item is what we'd call a sensitive item - an item that relates to a behavior that is illegal or taboo. Some other examples would include questions about behaviors like drug use, abortion, or masturbation. People are less likely to answer these questions honestly, because of fear of legal action or stigma.

Obviously, we have data on some of these sensitive issues. How do we get it? There are some important controls that help:
  • Ensure that data collected is anonymous - that is, the person collecting the data (and anyone accessing the data) doesn't know who it comes from
  • If complete anonymity isn't possible, confidentiality is the next best thing - unable to be linked back to respondent by anyone not on the study team, with personal data stored separately from responses
  • If the topic relates to illegal activity, additional protections (a Certificate of Confidentiality) may be necessary to prevent the data collection team from being forced to divulge information by subpoena 
  • Data collected through forms rather than an interview with a person might also lead to more honest responding, because there's less embarrassment writing something than saying it out loud; but remember, overall response rate drops with paper or online forms
The Census is confidential, not anonymous. Data is collected in person, by an interviewer, and personally identifiable data is collected, though extracted when data are processed. And yes, there are rules and regulations about who has access to that data. Even if those protections are held and people who share that they are not legal citizens have no need to fear legal action, the issue really has to do with perception, and how that perception will impact the validity of the data collected. 

When people are asked to share sensitive details that they don't want to share for whatever reason, they'll do one of two things: 1) refuse to answer the question completely or 2) lie. Either way, you end up with junk data. 

I'll be honest - I don't think the stated good intentions are the real reason for this item. We may disagree on how to handle people who are in this country illegally, but I think the issue we need to focus on here is that, methodologically, this items doesn't make sense and is going to fail. But because of the source and government seal, the data are going to be perceived as reliable, with the full weight of the federal government behind them. That's problematic. Census data influences policies, funding decisions, and distribution of other resources. If we cannot guarantee the reliability and validity of that data, we should not be collecting it.

Monday, February 26, 2018

Statistics Sunday (Late Edition): Exogenous vs. Endogenous Variables

I had a busy (but great) Sunday, and completely spaced on writing my Statistics Sunday post! But then, I could just say I posted on Sunday within a certain margin of error. (You've probably heard the one about the three statisticians trying to hit a target. One hits to the left of center, the other to the right. The third yells, "We got it!")

I'm planning on writing more posts on one of my favorite statistical techniques (or set of techniques): structural equation modeling. For today, I'm going to write about some terminology frequently used in SEM - exogenous and endogenous variables.

(Note, these terms are used in other contexts as well. My goal is to discuss how they're used in SEM specifically, as a set-up for future SEM posts.)

Whenever you put together a structural equation model, you're hypothesizing paths between variables. A path means one variable influences/is caused by another. In a measurement model, where observed variables are being used to reflect an underlying (latent) construct, the path from the construct to each of the variables signifies that the construct influences/causes the values of the observed variables.
Created with the semPlot package using a lavaan dataset - look for a future blog post on this topic!
In a path model, it means the same thing - that path means that one variables causes the other - but the variables used in the paths are usually the same kind of variable (observed or latent).
Created with the semPlot package using a lavaan dataset - look for a future blog post on this topic!
For instance, in the figure immediately above, Ind (short for Industrialization) causes both D60 and D65 (measures of democratization of nations in 1960 and 1965). D60, in turn, also causes D65. All 3 are latent variables, with observed variables being used to measure them. Lets ignore those observed (square) variables for now and just look at the 3 latent variables in the circles. Exogenous is the term used to refer to variables that cause other variables (and are not caused by any other variables). Endogenous refers to variables caused by other variables. So in the model just above, Ind is the only exogenous variable: it is caused by 0 (in the context of the model) and causes 2 variables. Both D60 and D65 are endogenous variables: D60 is caused by 1 and D65 is caused by 2.

You may be wondering what we would call a variable that is caused by 1 or more variables, and in turn, causes 1 or more variables. In this terminology, we would still call them endogenous, but we might also use another term: mediator.

Stay tuned for more SEM posts, where we'll start digging into the figures above and showing how it all works! I'm also gearing up for Blogging A to Z; look for a theme reveal on March 19. Spoiler alert: It will be stats related.

Getting Excited for Blogging A to Z!

It's almost March, meaning it's almost time to start thinking about a theme and schedule for Blogging A to Z! This will be my third year participating: last year, I blogged through the alphabet of statistics and the year before that, the alphabet of social psychology. And I have some fun ideas for this year.

The A to Z Challenge Blog is already sharing important dates and a survey for anyone interested in participating. Sign-up opens March 5, and the theme reveal is set for March 19.

Thursday, February 22, 2018

Statistical Sins: Overselling Automation

Yesterday, I blogged about a talk I attended at the ATP 2018 meeting, the topic of which was whether psychometricians could be replaced by AI. The consensus seemed to be that automation, where possible, is good. It frees up time for people to focus their energies on more demanding tasks, while farming out rule-based, repetitive tasks to various forms of technology. And there are many instances where automation is the best, most consistent way to achieve a desired outcome. At my current job, I inherited a process: score and enter results from a professional development program. Though the process of getting final scores and pass/fail status into our database was automated, the process to get there involved lots of clicking around: re-scoring variables, manually deleting columns, and so forth.

Following the process would take a few hours. Instead, after going through it the first time, I decided to devote half a day or so to automating the process. Yes, I spent more time writing the code and testing it than I would have if I'd just gone through the process itself. And that is presumably why it was never automated before now; the process, after all, only occurs once a month. But I'd happily take a one-time commitment of 4-5 hours, than a once-a-month commitment of 3. The code has been written, fully tested, and updated. Today, I ran that process in about 15 minutes, squeezing it between two meetings.

And there are certainly other ways we've automated testing processes for the better. Multiple speakers at the conference discussed the benefits of computer adaptive testing. Adaptive testing means that the precise set of items presented to an examinee is determined by the examinee's performance. If the examinee gets an item correct, they get a harder item; if incorrect, they get an easier item. Many cognitive ability tests - the currently accepted term for what was once called "intelligence tests" - are adaptive, and the examiner selects a starting question based on assumed examinee ability, then moves forward (harder items) or backward (easier items) depending on the examinee's performance. This allows the examiner to pinpoint the examinee's ability, in fewer items than fixed form exams.

While cognitive ability exams (like the Wechsler Adult Intelligence Scale) are still mostly presented as individually-administered adaptive exams, test developers discovered they could use these same adaptive techniques on multiple choice exams. But you wouldn't want to have a examiner sit down with each examinee and adapt their multiple choice exam; you can just have a computer do it for you. As many presenters stated during the conference, you can obtain accurate estimates of a person's ability in about half the items when using a computer adaptive test (CAT).

But CAT isn't a great solution to every testing problem, and this was one thing I found frustrating, because some presenters expressed frustration that CAT wasn't being utilized as much as it could. They speculated this was due to discomfort with the technology, rather than a thoughtful, conscious decision not to use CAT. This is a very important distinction and I suspect it is the case far more often that test develops use paper-and-pencil over CAT because it's the better option in their situation.

Like I said, the way CAT works is that the next item administered is determined by examinee performance on the previous item. The computer will usually start with an item of moderate difficulty. If the examinee is correct, they get a slightly harder item; if incorrect, a slightly easier item. Score on the exam is determined by the difficulty of items the examinee answered correctly. This means you need to have items across a wide range of abilities.

"Okay," you might say, "that's not too hard."

You also need to make sure you have items covering all topics from the exam.

At a wide range of difficulties.

And drawn at random from a pool, since you don't want everyone of a certain ability level to get the exact same items; you want to limit how much individual items are exposed to help deter cheating.

This means your item pool has to be large - potentially 1000s of items, and you'll want to roll-in and roll-out items as they get out-of-date or over-exposed. This isn't always possible, especially for smaller test development outfits or newer exams. At my current job, all of our exams are computer-administered, but only about half of them are CAT. While it's a goal to make all exams CAT, some of our item banks just aren't large enough yet, and it's going to take a long time and a lot of work to get there.

Of course, there's also the cost of setting up CAT - there are obviously equipment needs and securing (i.e., preventing cheating) a CAT environment requires attention to different factors than securing a paper-and-pencil testing environment. All of that costs money, which might be prohibitive for some organizations on its own.

Automation is good and useful, but it can't always be done. Just because something works well - and better than the alternative - in theory doesn't mean it can always be applied in practice. Context matters.


Wednesday, February 21, 2018

From the Desk of a Psychometrician (Travel Edition): Can AI Replace Us?

I'm in San Antonio at the moment, attending the Association of Test Publishers 2018 Innovations in Testing Meeting. This morning, I attended a session on a topic I've blogged about tangentially before - can AI replace statisticians? The session today was about whether AI could replace psychometricians.

The session had four speakers, one who felt AI could (should?) replace psychometricians, one who argued why it should not, and two who discussed what applications might be useful for AI and what would still need a human. In the session, the speakers differentiated between automation (which is rules-based) and machine learning (teaching a computer to make context-dependent decision-making), or to demonstrate with XKCD:


One involves an automated process of matching the user's location to maps of national parks; the other involves a decision - one that is quite easy for a human - about whether the photo contains a bird. As technology has developed, psychometricians have been able to automate a variety of processes, such as using a computer to adapt a test based on examinee performance or creating parallel forms of a test - something that originally had to be done by a person, but now can be easily done by a computer with the right inputs.

The question of the session was whether psychometricians should move onto using/be replaced by machine learning. Many psychometricians will tell you there is an art and a science to what we do. We have guidelines we follow on, for instance, item analysis - expectations about item function across various groups or patterns of responding, or ability of the item to differentiate between low and high performances (which we call discrimination, but not in the negative, prejudicial sense of the word). But those guidelines are simply one thing we use. We may choose to keep an item with lackluster discrimination if it fulfills an important topic for the exam. We may drop an item performing well because it's been in the exam pool for a while and exposure to candidates is high. And so on.

The take-home message of the session is that automation is good - but replacing us with machine learning is problematic, because it's difficult to quantify what psychometricians do. Overall, to have machine learning take over any process, a human needs to delineate the process and outcomes. Otherwise, the computer has no outcome to which it should target. So even if a computer could take over psychometricians' job duties, it needs information to predict toward.

As machine learning and computer algorithms improve, more problems can be tackled by humans, and perhaps using machine learning for more aspects of psychometrics would allow us to focus our energy on other advances. But to get there, these topics need to be understood by humans first.

Sunday, February 18, 2018

Statistics Sunday: What Are Residuals?

Recently, I wrote two posts about quantile regression - find them here and here. In the first post, I talked about an assumption of linear regression: homoscedasticity, "the variance of scores at one point on the line should be equal to the distribution of scores at other points along the line."


What I was talking about here has to do with the difference between the predicted value (which falls on the regression line) and the observed value (the points, which may or may not fall along the line). This difference is called a residual:

Residual = Observed - Predicted

There is one residual for each case used in your regression, and both the sum and mean of the residuals will be 0. In standard linear regression, which is known as ordinary least squares regression, the goal is to find a line that minimizes the residuals - some residuals will be positive and some will be negative. Remember, when you're dealing with deviations from some measure of central tendency, you need to square them, or else they add up to 0. Your regression line, then, is one that minimizes these squared residuals - least squares.

But, you might ask, why does variance need to be consistent across the regression line? Sure, you want a line that minimizes your residuals, but obviously, your residuals are going to vary to some degree. 

When I first introduced linear regression to you, I gave this basic linear equation you've probably encountered before:

y = bx + a

where b is the slope (measure of the effect of x on y), and a is the constant (the predicted value of y when x=0). But I left out one more term you'd find in linear regression - the error term:

y = bx + a + e

For this error term to be valid, there has to be consistent error - you don't want the error to be wildly different for some points than others. Otherwise, you can't model it (not with a single term, anyway).

When you conduct a linear regression, you should plot your residuals. This is easy to do in R. Let's use the same dataset I used for the quantile regression example, and we'll conduct a linear regression with these data (even though we're violating a key assumption - this is just to demonstrate the procedure). Specify the linear regression using lm, followed by the model outcome~predictor(s):

library(quantreg)
library(ggplot)
options(scipen=999)
model<-lm(foodexp~income, engel)
summary(model)

That gives me the following output:


I can reference my residuals with resid(model_name):

ggplot(engel, aes(foodexp,resid(model))) + geom_point()


You'll notice a lot of the residuals cluster around 0, meaning the predicted values are close to the observed values. But for higher values of y, the residuals are also larger, so this plot once again demonstrates that ordinary least squares regression isn't suited for this dataset. Yet another reason why it's important to look at your residuals.

I'm out of town at the moment, attending the Association of Test Publishers Innovations in Testing meeting. Hopefully I'll have some things to share from the conference over the next couple days. At the very least, it's a lot warmer here than it is in Chicago.