Wednesday, February 7, 2018

Statistical Sins: Olympic Figure Skating and Biased Judges

The 2018 Winter Olympics are almost here! And, of course, everyone is already talking about the events that have me as mesmerized as the gymnasts in the Summer Olympics - figure skating.

Full confession: I love figure skating. (BTW, if you haven't yet seen I, Tonya, you really should. If for no other reason than Margot Robbie and Allison Janney.)

In fact, it seems everyone loves figure skating, so much that the sport is full of drama and scandals. And with the Winter Olympics almost here, people are already talking about the potential for biased judges.

We've long known that ratings from people are prone to biases. Some people are more lenient while others are more strict. We recognize that even with clear instructions on ratings, there is going to be bias. This is why in research we measure things like interrater reliability, and work to improve it when there are discrepancies between raters.

And if you've peeked at the current International Skating Union (ISU) Judging System, you'll note that the instructions are quite complex. They say the complexity is designed to prevent bias, but when one has to put so much cognitive effort into understanding something so complex, they have less cognitive energy to suppress things like bias. (That's right, this is a self-regulation and thought suppression issue - you only have so many cognitive resources to go around, and anything that monopolizes them will leave an opening for bias.)

Now, bias in terms of leniency and severity is not the real issue, though. If one judge tends to be more harsh and another tends to be more lenient, those tendencies should wash out thanks to averages. (In fact, total score is a trimmed mean, meaning they throw out the highest and lowest scores. A single very lenient judge and a single very harsh judge will then have no impact on a person's score.) The problem is when the bias emerges with certain people versus others.

At the 2014 Winter Olympics, the favorite to win was Yuna Kim of South Korea, who won the gold at the 2010 Winter Olympics. She skated beautifully; you can watch here. But she didn't win the gold, she won the silver. The gold went to Adelina Sotnikova of Russia (watch her routine here). The controversy is that, after her routine, she was greeted and hugged by the Russian judge. This was viewed by others as a clear sign of bias, and South Korea complained to the ISU. (The complaints were rejected, and the medals stood as awarded. After all, a single biased judge wouldn't have gotten Sotnikova such a high score; she had to have high scores across most, if not all, judges.) A researcher interviewed for NBC news conducted some statistical analysis of judge data and found an effect of judge country-of-origin:

As a psychometrician, judge ratings are a type of measurement, and I personally would approach this issue as a measurement problem. Rasch, the measurement model I use most regularly these days, posits that an individual's response to an item (or, in the figure skating world, a part of a routine) is a product of the difficulty of the item and the ability of the individual. If you read up on the ISU judging system (and I'll be honest - I don't completely understand it but I'm working on: perhaps for a Statistics Sunday post!), they do address this issue of difficulty in terms of the elements of the program: the jumps, spins, steps, and sequences skaters execute in their routine.

There are guidelines as to which/how many of the elements must be present in the routine and they are ranked in terms of difficulty, meaning that successfully executing a difficult element results in more points awarded than successfully executing an easy element (and failing to execute an easy element results in more points deducted than failing to execute a difficult element).

But a particular approach to Rasch allows the inclusion of other factors that might influence scores, such as judge. This model, which considers judge to be a "facet," can model judge bias, and thus allow it to be corrected when computing an individual's ability level. The bias at issue here is not just overall; it's related to the concordance between judge home country and skater home country. This effect can be easily modeled with a Rasch Facets model.

Of course, part of me feels the controversy at the beginning of the NBC article and video above is a bit overblown. The video fixates on an element Sotnikova blew - a difficult combination element (triple flip-double toe-double loop) she didn't quite execute perfectly. (She did land it though; she didn't fall.)

But the video does not show the easier element, a triple Lutz, that Kim didn't perfectly execute. (Once again, she landed it.) Admittedly, I only watched the medal-winning performances, and didn't see any of the earlier performances that might have shown Kim's superior skill and/or Sotnikova's supposed immaturity, but I could see, based on the concept of element difficulty, why one might have awarded Sotnikova more points than Kim, or at least, have deducted fewer points for Sotnikova's mistake than Kim's mistake.

In a future post, I plan to demonstrate how to conduct a Rasch model, and hopefully at some point a Facets model, maybe even using some figure skating judging data. The holdup is that I'd like to demonstrate it using R, since R is open source and accessible by any of my readers, as opposed to the proprietary software I use at my job (Winsteps for Rasch and Facets for Rasch Facets). I'd also like to do some QC between Winsteps/Facets and R packages, to check for potential inaccuracies in computing results, so that the package(s) I present have been validated first.

No comments:

Post a Comment