Wednesday, August 16, 2017

Stats Note: The Third Variable Problem

Correlation does not imply causation. You've probably heard that many times - including from me. When we have a correlation between variable A and variable B, it could be that A caused B, B caused A, or another variable C causes both. A famous example is the correlation between ice cream sales and murder rates. Does ice cream make people commit murder? Does committing  murder make people crave ice cream? Or could it be that warm weather causes both? (Hint: It's that last one.)

The problem is that when people see a correlation between two things, and get confused about causality, they may intervene to change one thing in the hopes of changing the other. But that's not how it works. For a comedic example, this Saturday Morning Breakfast Cereal comic:

The cartoon references the famous Stanford "Marshmallow Study," which examined whether children could delay gratification. If you'd like to learn even more, the principal investigator, Walter Mischel, wrote a book about it.

Statistical Sins: Reinventing the Wheel - Some Open Data Resources

For today's Statistical Sins post, I'm doing things a little differently. Rather than discussing a specific study or piece of media about a study, I'm going to talk about general trend. There's all this great data out there that could be used to answer questions, but I still see study after study collecting primary data.

Secondary data is a great way to save resources, answer questions and test hypotheses with large samples (sometimes even random samples), and practice statistical analysis.

To quickly define terms, primary data is the term used to describe data you collect yourself, then analyze and write about. Secondary data is a general term for data collected by someone else (that is, you weren't involved in that data collection) that you can use for your own purposes. Secondary data could be anything from a correlation matrix in a published journal article to a huge dataset containing responses from a government survey. And just primary data can be qualitative, quantitative, or a little of both, so can secondary data.

We really don't have a good idea of how much data is floating around out there that researchers could use. But here are some good resources that can get you started on exploring what open data (data that is readily accessible online or that can be obtained through an application form) are available:
  • Open Science Framework - I've blogged about this site before, that lets you store your own data (and control how open it is) and access other open data
  • - The federal government's open data site, which not only has federal data, but also links to state, city, and county sites that offer open data as well
  • Global Open Data Index - To find open data from governments around the world
  • Open Data Handbook - This site helps you understand the nature of open data and helps you to make any data you've collected open, but there's also a resources tab that offers some open data sources
  • Project Open Data - Filled with great resources to help you on your open data journey, including some tools to convert data from one form (e.g., JSON files) to an easier-to-use form (e.g., CSV)
  • Open Access Button - Enter in a journal article you're reading, and this site will help you find or request the data
  • GitHub Open Data - Another open data option for some fun datasets, such as this dataset of Scrabble tournament games
And there's also lots of great data out there on social media. Accessing that data often involves interacting with the social media platform's API (application program interface). Here's more information about Twitter's API; Twitter, in general, is a great social media data resource, because most tweets are public. I highly recommend this book if you want to learn more about mining social media data:

Tuesday, August 15, 2017

Every Now and Then: Total Eclipse of the Sun

We're less than a week away from the total solar eclipse that will make its way across the United States from Oregon to South Carolina. It seems that everyone is getting in on the fun. For instance, the most recent XKCD:

Sky and Telescope provides this list of apps to use the day of the eclipse.

Unfortunately, some companies are taking advantage of the eclipse frenzy by selling counterfeit glasses - glasses that fail to comply with the proper standards. Amazon has been issuing refunds to people who purchased glasses that may not meet the proper standards. The American Astronomical Society published this list of reputable vendors.

I plan to watch the eclipse from St. Joseph, MO, which is close to where I grew up in Kansas City, KS. (I even applied for and almost accepted a job in St. Jo back in 2010, but opted to work for the VA instead.)

Monday, August 14, 2017

On Charlottesville and Trump

As you probably already know, a rally calling itself "Unite the Right" convened this weekend in Charlottesville, VA, to protest the removal of a monument to Robert E. Lee. The rally quickly turned violent when a car was driven into an anti-racism protest organized as a response to the Unite the Right rally; 19 were injured and 1 was killed. Two state police officers called to assist with maintaining order also died in a helicopter crash.

Many were calling for the President to respond to the rally.

When the President eventually did respond, he failed to distance himself from these individuals and the organizations they represent, and emphasized that there was violence and hatred on many sides:
We condemn in the strongest possible terms this egregious display of hatred, bigotry and violence, on many sides. On many sides. It's been going on for a long time in our country. Not Donald Trump, not Barack Obama. This has been going on for a long, long time.
As Julia Azari of FiveThirtyEight points out, though Presidential responses to racial violence have always been rather weak, Trump's are even weaker.

I walk by Trump Tower in Chicago every day on my way to work. Here's what I saw in front of the building today:

Sunday, August 13, 2017

Statistics Sunday: How Does the Consumer Price Index Work?

You may have heard news stories about how much consumer prices have risen (or fallen) in the last month, like this recent one. And maybe, like me, you've wondered, "But how do they know?" It's all thanks to the Consumer Price Index, released each month by the Bureau of Labor Statistics. The most recent CPI came out Friday.

The CPI is a great demonstration of sampling and statistical analysis, so for today's Statistics Sunday, we'll delve into the history and process of the CPI.

What is the Consumer Price Index?

The CPI is based on prices of a representative sample (or what the Bureau of Labor Statistics calls a "basket") of goods and services - the things that the typical American will buy. These prices, which are collected in 87 urban areas, from about 23,000 retail and service establishments and 50,000 landlords and tenants, are collected each month, then weighted by total expenditures (how much people typically spend on each) from the Consumer Expenditure Survey. What they get as a result is a measure of inflation: how much the price of this sample of goods and services has changed over time. The CPI can also be used to correct for inflation (when making historical comparisons) and to adjust income (for industries that have wages tied to the CPI through a collective bargaining agreement).

What's In the Basket?

The basket is determined from the results of the Consumer Expenditure Survey - the most recent one was in 2013 and 2014. These data are collected through a combination of interviews (often computer-guided, where an interviewer contacts the interviewee in person or over the phone, and asks a series of questions) and diary studies (in which families track their exact expenditures over a two-week period). The interviews and diaries assess over 200 categories of goods and services, that they organize into 8 broad categories:
  • Food and beverages - things like cereal, meat, coffee, milk, and wine
  • Housing - rent, furniture, and water or sewage charges
  • Apparel - clothing and certain accessories, like jewelry
  • Transportation - cost of a new car, gasoline, tolls, and car insurance
  • Medical care - prescriptions, cost of seeing a physician, or glasses
  • Recreation - television, tickets to movies or concerts, and sports equipment
  • Education and communication - college tuition, phone plans, and postage
  • Other goods and services - a catch-all for things that don't fit elsewhere, like tobacco products or hair cuts

How is this Information Collected?

Believe it or not, the people who collect data for the CPI either call or visit establishments to get the prices. The data are sent to commodity experts at the Bureau, who review the data for accuracy, and may make changes to items in the index through direct changes or statistical analysis. For instance, if an item on the list, like a dozen eggs, changes in some way, such as stores selling eggs in packs of 10 instead, the commodity experts have to determine if they should change the index or conduct analysis to correct for changing quantity. This is a pretty easy comparison to make (10 eggs versus 12 eggs), of course, but when the analysts start dealing with two products that may be very different in features (such comparing two different computers or tuition from different colleges), the analysis to equalize them for the index can become very complex. So not only are items weighted to generate the full index, but statistical analysis can occur throughout data preparation for generating the index.

Data for the three largest metropolitan areas - LA, New York, and Chicago - are collected monthly. Data for other urban areas are every other month, or twice a year.

History of the CPI

The history of the CPI can be traced back to the late 1800s. The Bureau of Labor, which later became the Bureau of Labor Statistics, did its first major study from 1888 to 1891. This study was ordered by Congress to assess tariffs they had introduced to help pay off the debt from the Civil War. They were interested in key industrial sectors: iron and steel, coal, textiles, and glass. This is one of the first examples of applying indexing techniques to economic data.

From then on, the Bureau often did small statistical studies to answer questions for Congress and the President. In 1901 to 1903, they broadened their scope by doing a study of family expenditures, as well as analysis of costs from retailers, and applied the indexing techniques they had developed for industry to retail and living expenses. They published the results in a report called Relative Retail Price of Food, Weighted According to the Average Family Consumption, 1890 to 1902 (base of 1890–1899). Despite seeming quite dull from the title and subject matter, this report was actually quite controversial, because it highlighted a gap in growth in wages versus increase in cost of living - that is, wages had grown more than costs, resulting in increased purchasing power. But it was released during a banking crisis, where many people were laid off and wages were cut, so the Bureau was accused of being politically motivated in their research and conclusions.

As a result of the outcry, and budget concerns, research by the Bureau was halted in 1907, and was very limited in scope when it returned in 1911, assessing fewer items and using mail surveys from retailers rather than visits by Bureau staff.

New leadership in the Bureau and the beginning of World War I rekindled research efforts. They began publishing a retail price index twice a year in 1919. But the Bureau got a major redux thanks to the efforts of FDR's Secretary of Labor, Frances Perkins. She made efforts to modernize the organization and recruit experts in the fields of methodology and statistical analysis. Two major contributors were American economist and statistician Helen Wright and British statistician Margaret Hogg. In fact, Hogg conducted analysis that demonstrated the current weights used for the index were biased, by overstating the importance of food, and understating the importance of other goods and services, in the index. When they also made changes to the sample of prices to include, they had to hire more staff to go out and collect price data.

Other major changes in the history of the CPI included introducing an index specific to "lower-salaried workers in large cities" in the early 1940s, a gradual shift from a constant-goods (where the same basket is always used) to a constant-utility (where goods for the basket are determined by level of utility or satisfaction - that is, new useful goods can be added) framework from the 1940s to 1970s, and a partnership with the U.S. Census Bureau in the late 1970s. The first collective bargaining agreements - in which companies agreed to link workers' wages to the CPI to prevent strikes - occurred in the late 1940s and early 1950s.

Summing It All Up

Not only is the CPI an index of inflation - it represents cultural shifts in how we think about and consume goods and services. The shifting basket over time reflects changes in our day-to-day lives, the birth and/or death of different industries, and the changes in technology.

I'll admit, I wasn't really that interested in the CPI until I learned about the contributions of statisticians over the years. And it's an example of women making strong contributions to economic and statistical thought, so it's a shame that we don't hear more about it. In fact, statistician Dr. Janet Norwood, who joined the Bureau in 1963, and served as commissioner from 1979 to 1991, made some very important changes in her time there. For instance, a representative of the policy arm of the Department of Labor sat in on meetings about research results and press released from the Bureau - until Dr. Norwood stopped this practice to make sure economic information was seen as accurate and nonpartisan.

If you're now as fascinated as me, you can learn more about the CPI and its data here.

Saturday, August 12, 2017

Another Response to the Google Memo

On Wednesday, I wrote my own response to the "Google memo" in which I focused on the (pseudo)science used in the memo. I had such a great time writing that post and in chatting with people after that I'm working on another writing project along those lines. Stay tuned.

But I'm thankful to Holly Brockwell, for focusing on the history of women in tech in her response. Because as she points out, women were there all along:
The viewpoint Damore is espousing is known as biological essentialism. It’s used by people who have been told all their lives that they’re special and brilliant, and in moments of insecurity or arrogance, seek to prove this with junk science. Junk science like “women are biologically unsuited to technical work”, which – despite all his thesaurus-bothering, pseudoscientific linguistic cladding (see, I can do it too) – is the reductive crux of his argument.

Damore clearly thinks he’s schooling the world on biology, but it’s actually history he should have been paying attention to. Because he either doesn’t know or has chosen to forget that women were the originators of programming, and dominated the software field until men rode in and claimed all the glory.
Ada Lovelace, author of the first computer algorithm
The fact is, programming was considered repetitive, unglamorous “women’s work”, like typing and punching cards, until it turned out to be a lucrative and prestigious field. Then, predictably, the achievements of women were wiped from the scoreboard and men like James Damore pretended they were never there.

Marie Hicks, author of Programmed Inequality – How Britain Discarded Women Technologists and Lost Its Edge in Computing, believes the subordination of women in computer science has limited progress for everyone.

“The history of computing shows that again and again women’s achievements were submerged and their potential squandered – at the expense of the industry as a whole,” she explains. “The many technical women who were good at their jobs had the opportunity to train their male replacements once computing began to rise in prestige – and were subsequently pushed out of the field.

“These women and men did the same work, yet the less experienced newcomers to the field were considered computer experts, while the women who trained them were merely expendable workers. This has everything to do with power and cultural expectation, and nothing to do with biological difference.”

It might be comforting for mediocre men to believe that they’re simply born superior. That’s what society’s been telling them all their lives, and no one questions a compliment. But when they try to dress up their insecurities as science, they’d better be ready for women to challenge them on the facts. Because really, sexism is just bad programming, and we’d be happy to teach you how to fix it.
In fact, some of the first women to contribute to statistics did so as human computers, who worked for many hours repeating calculations on mechanical calculators to fill in the tables of critical values and probabilities to accompany statistical tests.

Friday, August 11, 2017

Made for Math

Via NPR, research suggests that we're all born with math abilities, which we can hone as we grown:
As an undergraduate at the University of Arizona, Kristy vanMarle knew she wanted to go to grad school for psychology, but wasn't sure what lab to join. Then, she saw a flyer: Did you know that babies can count?

"I thought, No way. Babies probably can't count, and they certainly don't count the way that we do," she says. But the seed was planted, and vanMarle started down her path of study.

What's been the focus of your most recent research?

Being literate with numbers and math is becoming increasingly important in modern society — perhaps even more important than literacy, which was the focus of a lot of educational initiatives for so many years.

We know now that numeracy at the end of high school is a really strong and important predictor of an individual's economic and occupational success. We also know from many, many different studies — including those conducted by my MU colleague, David Geary — that kids who start school behind their peers in math tend to stay behind. And the gap widens over the course of their schooling.

Our project is trying to get at what early predictors we can uncover that will tell us who might be at risk for being behind their peers when they enter kindergarten. We're taking what we know and going back a couple steps to see if we can identify kids at risk in the hopes of creating some interventions that can catch them up before school entry and put them on a much more positive path.

Your research points out that parents aren't engaging their kids in number-learning nearly enough at home. What should parents be doing?

There are any number of opportunities (no pun intended) to point out numbers to your toddler. When you hand them two crackers, you can place them on the table, count them ("one, two!" "two cookies!") as they watch. That simple interaction reinforces two of the most important rules of counting — one-to-one correspondence (labeling each item exactly once, maybe pointing as you do) and cardinality (in this case, repeating the last number to signify it stands for the total number in the set). Parents can also engage children by asking them to judge the ordinality of numbers: "I have two crackers and you have three! Who has more, you or me?"

Cooking is another common activity where children can get exposed to amounts and the relationships between amounts.

I think everyday situations present parents with lots of opportunities to help children learn the meanings of numbers and the relationships between the numbers.