Wednesday, October 18, 2017

Statistical Sins: Know Your Variables (A Confession)

We all have the potential to be a statistical sinner; I definitely have been on more than one occasion. This morning, I was thinking about a sin I committed about a year ago at Houghton Mifflin Harcourt. So this is a confessional post.

We were working on a large language survey, involving 8 tests, one of which was new. This is an individually-administered battery of tests, meaning a trained individual gives the test one-on-one to the examinee. Questions are read aloud and the examinee responds either verbally or in writing. Each test only has one set of questions, and is adaptive: the set of questions the examinee receives depends on their pattern of correct answers. If they get the first few questions right, they go on to harder questions, but if they get the first few wrong, they go back in the book to easier questions. The test ends when the examinee gets a certain number incorrect in a row or reaches the end of the book (whichever comes first).

When giving the test, the administrator won't always start at the beginning of the book. Those are the easiest questions, reserved for the youngest/lowest ability test-takers. Each test has recommended starting places, usually based on age, but the administrator is encouraged to use his or her knowledge of the examinee (these tests are often administered by school psychologists, who may have some idea of the examinee's ability) to determine a starting point.

We had one brand new test and needed to generate starting points, since we couldn't use starting points from a previous revision of the battery. We decided, since this new test was strongly related to another test, to generate recommended starting points based on their raw score on this other test. We knew we would need a regression-based technique, but otherwise, I was given complete control over this set of analyses.

After generating some scatterplots, I found the data followed a pretty standard growth curve, specifically a logistic growth curve:


So standard linear regression would not work, because of the curve. We would deal with this in regression by adding additional terms (squared, cubed, and so on) to address the curve.

But the data violated another assumption of regression, even polynomial regression: the variance was not equal (or even approximately equal) across the curve. There was substantially more variation in some parts of the curve than others. In statistical terms, we call this heteroscedasticity. I did some research and found a solution: quantile regression. It's a really cool technique that is pretty easy to pick up if you can understand regression. Essentially, quantile regression allows for different starting points (constants) and slopes depending on the percentile of the individual data point. You can set those percentiles at whatever value you would like. And quantile regression makes no assumptions about heteroscdasticity. I read some articles, learned how to do the analysis in R (using the quantreg package), and away I went.

I was so proud of myself.

We decided to use raw score instead of scale score for the starting points. These tests were developed with the Rasch measurement model, but the test administrator would only get approximate scale score from the tables in the book. Final scores, which are conversions of Rasch logits, are generated by a scoring program used after administering all tests. Since the administrator is obtaining raw scores as he or she goes (you have to know right away if a person responded correctly to determine what question to ask next), this would be readily available and most logical to administrators. I had my Winsteps output, which gave person ID, raw score, Rasch ability, and some other indicators (like standard error of measurement), for each person in our pilot sample. So I imported those outputs from the two tests, matched on ID, and ran my analysis.

I stress once again: I used the Winsteps person output to obtain my raw scores.

My data were a mess. There seemed to be no relationship between scores on the two tests. I went back a step, generating frequencies and correlations. I presented the results to the team and we talked about how this could have happened. Was there something wrong with the test? With the sample? Were we working with the wrong data?

I don't know who figured it out first, but it was not me. Someone asked, "Where did the raw scores come from?" And it hit me.

Winsteps generates raw scores based on the number of items a person answered correctly. Only the questions answered and no others. But for adaptive tests, we don't administer all questions. We only administer the set needed to determine a person's ability. We don't give them easy questions because they don't tell us much about ability. We know the person will get most, if not all, easy questions correct. So when the administrator generates raw scores, he or she adds in points for the easy questions not administered. Winsteps doesn't do that. It simply counts and adds.

There was no relationship between the two variables because I wasn't using the correct raw score variable. I had a column called raw score and just went on autopilot.

So I had a couple days of feeling super proud of myself for figuring out quantile regression... and at least that long feeling like an idiot for running the analysis without really digging into my data. The lack of relationship between the two tests should have been a dead giveaway that there was something wrong with my data. And problems with data are often caused by human error.


Monday, October 16, 2017

Preparing for NaNoWriMo

I'm preparing once again to write 50,000 words in November, as part of National Novel Writing Month (or NaNoWriMo). October is affectionately known as "Preptober" - it's the month where NaNoWriMos go through whatever planning they need to do to win (i.e., get 50,000 words).

As I've blogged before, I'm a plantser: I like having some freedom to see where the story takes me, but I need a roadmap or I get overwhelmed and don't finish. This makes me a sort of hybrid of the plot-driven writers, like Orson Scott Card, and the character-driven writers, like Stephen King.

Speaking of Stephen King, as part of Preptober, I've been reading advice on writing, and yesterday, just finished Stephen King's book on the topic:


It was nice to learn more about his approach, because last year, I really didn't think his approach was for me. I'd tried just sitting down and writing, seeing where I ended up, and that has worked reasonably well for me on short stories, but for something as long as a novel, I get blocked.

What Stephen King does is he comes up with a situation, which may also include where he thinks things will end up when he's finished writing. Then he comes up with the characters and develops them as they encounter the situation. And that's when he lets things just... unfold. The characters tell him where things go.

This may sound crazy to a non-writer: the characters tell him... They're creations of the author, so how could they tell the author what to write? It's a very strange thing when you're writing, and you create characters that take on a life of their own. I've had this experience several times when I was writing Killing Mr. Johnson, the book I worked on for last year's NaNoWriMo (a book I still need to finish). In fact, I was thinking about the story the other day, and trying to understand a character's motivation for doing something in the story. She reacted in a strange way to the chain of events, and as I was thinking about her, I realized why - or rather, she told me why. And it all made sense. I also have a couple of characters who are demanding more attention, so I plan on writing a few more scenes for them.

For this year's NaNoWriMo, I'll be working on a new idea. And I'm going to try taking the Stephen King approach, at least in part. I already know approximately how things are going to end up, and I've been working on developing the characters. In November, I'm going to try just sitting down and writing. We'll see how it goes.

Sunday, October 15, 2017

Statistics Sunday: These Are a Few of My Favorite Posts

You know how sit-coms would occasionally have a flashback episode? There would be some sort of situation where the main characters are stuck somewhere (like an elevator) and as they wait to get out, reminisce about different things that happened over the last season. You got to review the highlights, and the writers (and actors) got a break.

That's what today is: here are some of my favorite posts from the course of my statistics writing - posts people seemed to enjoy or that I had a lot of fun writing.

Statistical Sins: Handing and Understanding Criticism - I'm really enjoying blogging about the Fisher, Pearson, and Neyman feuds. In fact, in line with the new Edison vs. Westinghouse movie, The Current War, I'm thinking of writing my own dramatic account of these feuds. I mean, if they can make alternating current versus direct current seem exciting, just wait until you watch the scene where Neyman escapes Fisher's criticism because Fisher can't speak a word of French. I just need to figure out who Benedict Cumberbatch should play.

Statistics Sunday: Why Is It Called Data Science? - This post generated so much discussion. It's exactly what I've wanted to see from my blog posts: thoughtful discussion in response. Even vehement disagreement is awesome. My professor's heart grew three sizes that day.

Statistical Sins: Three Things We Love - And the greatest of these is still bacon.

Statistics Sunday: What Are Degrees of Freedom? (Part 2) - My favorite type of post to write; one where I learn something by going through the act of explaining it to others.

Statistical Sins: Women in Tech (Here It Goes Again) - In which I rip apart the Google memo, written by a guy who clearly doesn't remember (know anything about) the long history of women in programming and mathematics. Seriously, didn't he at least watch Hidden Figures?

Statistics Sunday: Everyone Loves a Log (Odds Ratio) - Which helped set the stage for a post about Rasch.

Statistics Sunday: No Really, What's Bayes' Got to Do With It? - When I first encountered Bayes' Theorem, I had some trouble wrapping my head around it. So I did the same thing as I did for degrees of freedom: I made myself sit down and write about it. And I finally understand it. Tversky and Kahneman would be so proud.

Statistics Sunday: Null and Alternative Hypotheses - Philosophy of science is one of my favorite topics to pontificate about. It's even more fun for me than debating semantics... and I love debating semantics.

Great Minds in Statistics: F.N. David versus the Patriarchy - Ooo, another movie idea. I very nearly called this post F.N. David versus the Mother F***ing Patriarchy, but decided against it.

That's all for today! This afternoon, I'll be performing Carmina Burana with the Chicago Philharmonic and my choir, the Apollo Chorus of Chicago. And heading out of town tomorrow.

Also, I'm horribly unoriginal: I did this once before. And of course, you can dig through my April 2017 Blogging A to Z Challenge, in which I wrote about the A to Z of statistics.

I'm working on some new statistics Sunday posts. What topics would you like to see here?

Friday, October 13, 2017

Statistical Sins: Hidden Trends and the Downfall of Sears

Without going into too much details, it's been a rough week, so I haven't really been blogging much. But today, I made myself sit down and start thinking about what statistical sin to talk about this week. There were many potential topics. In fact, the recent events in Las Vegas has resulted in a great deal of writing about various analyses, trying to determine whether or not shootings are preventable - specifically by assessing what impact various gun laws have had on the occurrence of these events. Obviously this is a difficult thing to study statistically because these types of shootings are still, in the relative sense, rare, and a classical statistics approach is unlikely to uncover many potential predictors with so little variance to partition. (A Bayesian approach would probably be better.) I may write more on this in the future, but I'll admit I don't have the bandwidth at the moment to deal with such a heavy subject.

So instead, I'll write about a topic my friend over at The Daily Parker also covers pretty frequently: the slow death of Sears. You see, Sears made some really great moves by examining the statistics, but also made some really bad moves by failing to look at the statistics and use the data-drive approaches that allowed its competitors to thrive.

The recession following World War I, combined with an increase in chain stores, threatened Sears's mail order business. It was General Robert Wood, who was incredibly knowledgeable about the U.S. Census and Statistical Abstract, that put Sears back on track by urging them to open brick-and-mortar stores. By the mid-20th century, Sears revenue accounted for 1 percent of U.S. GDP.

But then the market shifted again in the 1970s and 80s, and the decisions Sears made at this time paved the way for its downfall, at least according to Daniel Raff and Peter Temin. As Derek Thompson of CityLab summarizes their insightful essay:
Eager to become America’s largest brokerage, and perhaps even America’s largest community bank, Sears bought the real-estate company Coldwell Banker and the brokerage firm Dean Witter. It was a weird marriage. As the financial companies thrived nationally, their Sears locations suffered from the start. Buying car parts and then insuring them against future damage makes sense. But buying a four-speed washer-dryer and then celebrating with an in-store purchase of some junk bonds? No, that did not make sense.

But the problem with the Coldwell Banker and Dean Winter acquisitions wasn’t that they flopped. It was that their off-site locations didn’t flop—instead, their moderate success disguised the deterioration of Sears’s core business at a time when several competitors were starting to gain footholds in the market.
The largest competitor was Walmart, which not only offered cheap goods, it used data-driven approaches to ensure shelves were stocked with products most likely to sell, and that inventory was determined entirely by those figures. Sears was instead asking local managers to report trends back to headquarters.

As I commented almost exactly 6 years ago (what are the odds of that?), using "big data" to understand the customer is becoming the norm. Businesses unwilling to do this are not going to last.

Sears, sadly, is not going to survive. But Derek Thompson has some advice for Amazon, which he considers today's counterpart for yesterday's Sears.
First, retail is in a state of perpetual metamorphosis. People are constantly seeking more convenient ways of buying stuff, and they are surprisingly willing to embrace new modes of shopping. As a result, companies can’t rely on a strong Lindy Effect in retail, where past success predicts future earnings. They have to be in a state of constant learning.

Second, even large technological advantages for retailers are fleeting. Sears was the logistics king of the middle of the 20th century. But by the 1980s and 1990s, it was woefully behind the IT systems that made Walmart cheaper and more efficient. Today, Amazon now finds itself in a race with Walmart and smaller online-first retailers. Amazon shows few signs of technological complacency, but the company is still only in its early 20s; Sears was overtaken after it had been around for about a century.

Third, there is no strategic replacement for being obsessed with people and their behavior. Walmart didn’t overtake Sears merely because its technology was more sophisticated; it beat Sears because its technology allowed the company to respond more quickly to shifting consumer demands, even at a store-by-store level. When General Robert Wood made the determination to add brick-and-mortar stores to Sears’s mail-order business, his decision wasn’t driven by the pursuit of grandeur, but rather by an obsession with statistics that showed Americans migrating into cities and suburbs inside the next big shopping technology—cars.

Finally, adding more businesses is not the same as building a better business. When Sears added general merchandise to watches, it thrived. When it added cars and even mobile homes to its famous catalogue, it thrived. When it sold auto insurance along with its car parts, it thrived. But then it chased after the 1980s Wall Street boom by absorbing real-estate and brokerage firms. These acquisitions weren’t flops. Far worse, they were ostensibly successful mergers that took management’s eye off the bigger issue: Walmart was crushing Sears in its core business. Amazon should be wary of letting its expansive ambitions distract from its core mission—to serve busy shoppers with unrivaled choice, price, and delivery speed.
 

Monday, October 9, 2017

Complex Models and Control Files: From the Desk of a Psychometrician

We're getting ready to send out a job analysis survey, part of our content validation study. In the meantime, I'm working on preparing control files to analyze the data when we get it back. I won't be running the analysis for a couple weeks, but the model I'll be using is complex enough (in part because I added in some nerdy research questions to help determine best practices for these types of surveys), I decided to start thinking about it now.

I realize there's a lot of information to unpack in that first paragraph. Without going into too much detail, here's a bit of background. We analyze survey data using the Rasch model. This model assumes that an individual's response to an item is a function of his/her ability level and the difficulty level of the item itself. For this kind of measure, where we're asking people to rate items on a scale, we're not measuring ability; rather, we're measuring a trait - an individual's proclivity toward a job task. In this arrangement, items are not difficult/easy but more common/less common, or more important/less important, and so on. The analysis gives us probabilities that people at different ability (trait) levels will respond to an item in a certain way:


It's common for job analysis surveys to use multiple rating scales on the same set of items, such as having respondents go through and rate items on how frequently they perform them, and then go through again and rate how important it is to complete a task correctly. For this kind of model, we use a Rasch Facets model. A facet is something that affects responses to an item. Technically, any Rasch model is a facets model; in a basic Rasch model, there are two facets: respondents (and their ability/trait level) and items. When you're using multiple rating scales, scale is a facet.

And because I'm a nerd, I decided to add another facet: rating scale order. The reason we have people rate with one scale then go through and rate with the second (instead of seeing both at once) is so that people are less likely anchor responses on one scale to responses on another scale. That is, if I rate an item as very frequent, I might also view it as more important when viewing both scales than I would have had I used the scales in isolation. But I wonder if there still might be some anchoring effects. So I decided to counterbalance. Half of respondents will get one scale first, and the other half will get the other scale first. I can analyze this facet to see if it affected responses.

This means we have 4 facets, right? Items, respondents, rating scale, and order. Well, here's the more complex part. We have two different versions of the frequency scale: one for tasks that happen a lot (and assess daily frequency) and one for less common tasks (that assess weekly/monthly frequency). All items use the same importance scale. The two frequency scales have the same number of categories, but because we may need to collapse categories during the analysis phase, it's possible that we'll end up with two very different scales. So I need to factor in that, for one scale, half of items share one common response structure and the other half share the other common response structure, but for the other scale, all items share a common response structure.

I'm working on figuring out how to express that in the control file, which is a text file used by Rasch software to describe all the aspects of the model and analysis. It's similar to any syntax file for statistics software: there's a specific format needed for the program to read the file and run the specified analysis. I've spent the morning digging through help files and articles, and I think I'm getting closer to having a complete control file that should run the analysis I need.

The Voice of the CTA

Whenever I get into a speaking elevator, or follow the sound of "Track 8" to find my way to my Metra home, I wonder about the person behind the voice. Today, a friend shared a great video of Lee Crooks, the voice behind the Chicago Transit Authority:


Now I'll have a face to picture as I'm crowding onto a train to the tune of "Doors closing."

Sunday, October 8, 2017

Statistics Sunday: Why Is It Called 'Data Science'?

In one of the Facebook groups where I share my statistics posts, a user had an excellent question: "Why is it called data science? Isn't any science that uses empirical data 'data science'?"

I thought this was a really good point. Calling this one field data science implies that other scientific fields using data are not doing so scientifically or rigorously. And even data scientists recognize that there's a fair amount of 'art' involved in data science, because there isn't always a right way to do something - there are simply ways that are more justified than others. In fact, I just started working through this book on that very subject:

What I've learned digging into this field of data science, in the hopes of one day calling myself a data scientist, is that statistics is an integral part of the field. Further, data science is a team sport - it isn't necessary (and it may even be impossible) to be an expert in all the areas of data science: statistics, programming, and domain knowledge. As someone with expertise in statistics, I'm likely better off building additional knowledge in statistical analysis used in data science, like machine learning, and building up enough coding knowledge to be able to understand my data science collaborators with expertise in programming.

But that still doesn't answer our main question: why is it called data science? I think what it comes down to is that data science involves teaching (programming) computers to do things that once had to be done by a person. Statistics as a field has been around much longer than computers (and I mean the objects called computers, not the people who were once known as computers). In fact, statistics has been around even prior to mechanical calculators. Many statistical approaches didn't really need calculators or computers. It took a while, but you could still do it by hand. All that was needed was to know the math behind it. And that is how we teach computers - as long as we know the math behind it, we can teach a computer to do just about anything.

First, we were able to teach computers to do simple statistical analyses: descriptives and basic inferential statistics. A person can do this, of course; a computer can just do it faster. We kept building up new statistical approaches and teaching computers to do those analyses for us - complex linear models, structural equation models, psychometric approaches, and so on.

Then, we were able to teach computers to learn from relationships between words and phrases. Whereas before we needed a person to learn the "art" of naming things, we developed the math behind it and taught it to computers. Now we have approaches like machine learning, where you can feed in information to the computer (like names of paint shades or motivational slogans) and have the computer learn how to generate that material itself. Sure, the results of these undertakings are still hilarious and a long way away from replacing people, but as we continue to develop the math behind this approach, computers will get better.

Related to this concept (and big thanks to a reader for pointing this out) is the movement from working with structured data to unstructured data. Once again, we needed a person to enter/reformat data so we could work with it; that's not necessary anymore.

So we've moved from teaching computers to work with numbers to words (really any unstructured data). And now, we've also taught computers to work with images. Once again, you previously needed a person to go through pictures and tag them descriptively; today, a computer can do that. And as with machine learning, computers are only going to get better and more nuanced in their ability to work with images.

Once we know the math behind it, we can teach a computer to work with basically any kind of data. In fact, during the conference I attended, I learned about some places that are working with auditory data, to get computers to recognize (and even translate in real-time) human languages. These were all tasks that needed a human, because we didn't know how to teach the computers to do it for us. That's what data science is about. It still might not be a great name for the field, but I can understand where that name is coming from.

What are you thoughts on data science? Is there a better name we could use to describe it? And what do you think will be the next big achievement in data science?