Saturday, August 19, 2017

Countdown to the Eclipse

We're just days away from the 2017 total solar eclipse, and I'm writing this from my parents' house in Kansas City. We'll be heading north on Monday to watch the eclipse, since we won't be able to see the totality from here, and we're already equipped with our ISO-compliant eclipse glasses.

Hopefully you, dear reader, have identified where you'll be able to watch the eclipse. And if you're curious about what the eclipse will look like in different locations, Time Magazine has put together this awesome animation: enter a zip code and you'll see animation of what the eclipse will look like there. As an example, here's a GIF of what the eclipse will look like from Goreville, Illinois, which will see a full 2 and a half minutes of totality:


We've also purchased a solar filter for our camera, so we'll be able to get some pictures of the eclipse. Check in Monday for an update!

Thursday, August 17, 2017

Women in the Work Force

Via Bloomberg, the Bureau of Labor Statistics released data showing that the work force participation rate among women has increased by 0.3 percentage points since January, bringing the gap in participation rate between men and women to 13.2 percent.


This is the lowest that gap has been since 1948. However, overall participation in the U.S. is low at 62.9 percent. This is due in part to decreased participation rates among prime-age men:
The declining participation among prime-age male workers has become an area of focus for President Donald Trump’s administration. Trump campaigned on reviving traditionally male-dominated industries such as coal mining and manufacturing that have struggled against greater globalization. Amid record-high job openings, the president has emphasized that Americans need to be open about relocating for work.
You know, like how Trump has relocated for his job, and stopped spending so much time at his penthouse in New York or his resort at Mar-a-Lago.

The reason for the lower participation rate overall, and especially among men, has many potential causes:
Prohibitive childcare costs make parents’ decision to return to work more difficult, and prime-age Americans are feeling the increased burden of caring for an aging population. The opioid epidemic also helps explain why a portion of the workforce is deemed unemployable. And immigration limits imposed by the Trump administration could curb workforce growth in industries such as farming and construction that are dominated by the foreign-born.
The Bloomberg article also highlights some recent work by Thumbtack Inc., which has found increases in women-owned business in traditionally male-dominated professions:
Lucas Puente, chief economist at Thumbtack Inc., sees advances across the industries in which his company matches consumers and professional service workers. While men still make up about 60 percent of the 250,000 active small businesses listing their services on Thumbtack, women are gaining ground more quickly, even among traditionally male-dominated professions. Among the top 10 fastest-growing women-owned businesses on Thumbtack in the past year are plumbers, electricians, and carpenters, according to the company’s survey data.

Wednesday, August 16, 2017

Stats Note: The Third Variable Problem

Correlation does not imply causation. You've probably heard that many times - including from me. When we have a correlation between variable A and variable B, it could be that A caused B, B caused A, or another variable C causes both. A famous example is the correlation between ice cream sales and murder rates. Does ice cream make people commit murder? Does committing  murder make people crave ice cream? Or could it be that warm weather causes both? (Hint: It's that last one.)

The problem is that when people see a correlation between two things, and get confused about causality, they may intervene to change one thing in the hopes of changing the other. But that's not how it works. For a comedic example, this Saturday Morning Breakfast Cereal comic:


The cartoon references the famous Stanford "Marshmallow Study," which examined whether children could delay gratification. If you'd like to learn even more, the principal investigator, Walter Mischel, wrote a book about it.

Statistical Sins: Reinventing the Wheel - Some Open Data Resources

For today's Statistical Sins post, I'm doing things a little differently. Rather than discussing a specific study or piece of media about a study, I'm going to talk about general trend. There's all this great data out there that could be used to answer questions, but I still see study after study collecting primary data.

Secondary data is a great way to save resources, answer questions and test hypotheses with large samples (sometimes even random samples), and practice statistical analysis.


To quickly define terms, primary data is the term used to describe data you collect yourself, then analyze and write about. Secondary data is a general term for data collected by someone else (that is, you weren't involved in that data collection) that you can use for your own purposes. Secondary data could be anything from a correlation matrix in a published journal article to a huge dataset containing responses from a government survey. And just primary data can be qualitative, quantitative, or a little of both, so can secondary data.

We really don't have a good idea of how much data is floating around out there that researchers could use. But here are some good resources that can get you started on exploring what open data (data that is readily accessible online or that can be obtained through an application form) are available:
  • Open Science Framework - I've blogged about this site before, that lets you store your own data (and control how open it is) and access other open data
  • Data.gov - The federal government's open data site, which not only has federal data, but also links to state, city, and county sites that offer open data as well
  • Global Open Data Index - To find open data from governments around the world
  • Open Data Handbook - This site helps you understand the nature of open data and helps you to make any data you've collected open, but there's also a resources tab that offers some open data sources
  • Project Open Data - Filled with great resources to help you on your open data journey, including some tools to convert data from one form (e.g., JSON files) to an easier-to-use form (e.g., CSV)
  • Open Access Button - Enter in a journal article you're reading, and this site will help you find or request the data
  • GitHub Open Data - Another open data option for some fun datasets, such as this dataset of Scrabble tournament games
And there's also lots of great data out there on social media. Accessing that data often involves interacting with the social media platform's API (application program interface). Here's more information about Twitter's API; Twitter, in general, is a great social media data resource, because most tweets are public. I highly recommend this book if you want to learn more about mining social media data:

Tuesday, August 15, 2017

Every Now and Then: Total Eclipse of the Sun

We're less than a week away from the total solar eclipse that will make its way across the United States from Oregon to South Carolina. It seems that everyone is getting in on the fun. For instance, the most recent XKCD:


Sky and Telescope provides this list of apps to use the day of the eclipse.

Unfortunately, some companies are taking advantage of the eclipse frenzy by selling counterfeit glasses - glasses that fail to comply with the proper standards. Amazon has been issuing refunds to people who purchased glasses that may not meet the proper standards. The American Astronomical Society published this list of reputable vendors.

I plan to watch the eclipse from St. Joseph, MO, which is close to where I grew up in Kansas City, KS. (I even applied for and almost accepted a job in St. Jo back in 2010, but opted to work for the VA instead.)

Monday, August 14, 2017

On Charlottesville and Trump

As you probably already know, a rally calling itself "Unite the Right" convened this weekend in Charlottesville, VA, to protest the removal of a monument to Robert E. Lee. The rally quickly turned violent when a car was driven into an anti-racism protest organized as a response to the Unite the Right rally; 19 were injured and 1 was killed. Two state police officers called to assist with maintaining order also died in a helicopter crash.

Many were calling for the President to respond to the rally.


When the President eventually did respond, he failed to distance himself from these individuals and the organizations they represent, and emphasized that there was violence and hatred on many sides:
We condemn in the strongest possible terms this egregious display of hatred, bigotry and violence, on many sides. On many sides. It's been going on for a long time in our country. Not Donald Trump, not Barack Obama. This has been going on for a long, long time.
As Julia Azari of FiveThirtyEight points out, though Presidential responses to racial violence have always been rather weak, Trump's are even weaker.

I walk by Trump Tower in Chicago every day on my way to work. Here's what I saw in front of the building today:


Sunday, August 13, 2017

Statistics Sunday: How Does the Consumer Price Index Work?

You may have heard news stories about how much consumer prices have risen (or fallen) in the last month, like this recent one. And maybe, like me, you've wondered, "But how do they know?" It's all thanks to the Consumer Price Index, released each month by the Bureau of Labor Statistics. The most recent CPI came out Friday.

The CPI is a great demonstration of sampling and statistical analysis, so for today's Statistics Sunday, we'll delve into the history and process of the CPI.

What is the Consumer Price Index?

The CPI is based on prices of a representative sample (or what the Bureau of Labor Statistics calls a "basket") of goods and services - the things that the typical American will buy. These prices, which are collected in 87 urban areas, from about 23,000 retail and service establishments and 50,000 landlords and tenants, are collected each month, then weighted by total expenditures (how much people typically spend on each) from the Consumer Expenditure Survey. What they get as a result is a measure of inflation: how much the price of this sample of goods and services has changed over time. The CPI can also be used to correct for inflation (when making historical comparisons) and to adjust income (for industries that have wages tied to the CPI through a collective bargaining agreement).

What's In the Basket?

The basket is determined from the results of the Consumer Expenditure Survey - the most recent one was in 2013 and 2014. These data are collected through a combination of interviews (often computer-guided, where an interviewer contacts the interviewee in person or over the phone, and asks a series of questions) and diary studies (in which families track their exact expenditures over a two-week period). The interviews and diaries assess over 200 categories of goods and services, that they organize into 8 broad categories:
  • Food and beverages - things like cereal, meat, coffee, milk, and wine
  • Housing - rent, furniture, and water or sewage charges
  • Apparel - clothing and certain accessories, like jewelry
  • Transportation - cost of a new car, gasoline, tolls, and car insurance
  • Medical care - prescriptions, cost of seeing a physician, or glasses
  • Recreation - television, tickets to movies or concerts, and sports equipment
  • Education and communication - college tuition, phone plans, and postage
  • Other goods and services - a catch-all for things that don't fit elsewhere, like tobacco products or hair cuts

How is this Information Collected?

Believe it or not, the people who collect data for the CPI either call or visit establishments to get the prices. The data are sent to commodity experts at the Bureau, who review the data for accuracy, and may make changes to items in the index through direct changes or statistical analysis. For instance, if an item on the list, like a dozen eggs, changes in some way, such as stores selling eggs in packs of 10 instead, the commodity experts have to determine if they should change the index or conduct analysis to correct for changing quantity. This is a pretty easy comparison to make (10 eggs versus 12 eggs), of course, but when the analysts start dealing with two products that may be very different in features (such comparing two different computers or tuition from different colleges), the analysis to equalize them for the index can become very complex. So not only are items weighted to generate the full index, but statistical analysis can occur throughout data preparation for generating the index.

Data for the three largest metropolitan areas - LA, New York, and Chicago - are collected monthly. Data for other urban areas are every other month, or twice a year.

History of the CPI

The history of the CPI can be traced back to the late 1800s. The Bureau of Labor, which later became the Bureau of Labor Statistics, did its first major study from 1888 to 1891. This study was ordered by Congress to assess tariffs they had introduced to help pay off the debt from the Civil War. They were interested in key industrial sectors: iron and steel, coal, textiles, and glass. This is one of the first examples of applying indexing techniques to economic data.

From then on, the Bureau often did small statistical studies to answer questions for Congress and the President. In 1901 to 1903, they broadened their scope by doing a study of family expenditures, as well as analysis of costs from retailers, and applied the indexing techniques they had developed for industry to retail and living expenses. They published the results in a report called Relative Retail Price of Food, Weighted According to the Average Family Consumption, 1890 to 1902 (base of 1890–1899). Despite seeming quite dull from the title and subject matter, this report was actually quite controversial, because it highlighted a gap in growth in wages versus increase in cost of living - that is, wages had grown more than costs, resulting in increased purchasing power. But it was released during a banking crisis, where many people were laid off and wages were cut, so the Bureau was accused of being politically motivated in their research and conclusions.

As a result of the outcry, and budget concerns, research by the Bureau was halted in 1907, and was very limited in scope when it returned in 1911, assessing fewer items and using mail surveys from retailers rather than visits by Bureau staff.

New leadership in the Bureau and the beginning of World War I rekindled research efforts. They began publishing a retail price index twice a year in 1919. But the Bureau got a major redux thanks to the efforts of FDR's Secretary of Labor, Frances Perkins. She made efforts to modernize the organization and recruit experts in the fields of methodology and statistical analysis. Two major contributors were American economist and statistician Helen Wright and British statistician Margaret Hogg. In fact, Hogg conducted analysis that demonstrated the current weights used for the index were biased, by overstating the importance of food, and understating the importance of other goods and services, in the index. When they also made changes to the sample of prices to include, they had to hire more staff to go out and collect price data.

Other major changes in the history of the CPI included introducing an index specific to "lower-salaried workers in large cities" in the early 1940s, a gradual shift from a constant-goods (where the same basket is always used) to a constant-utility (where goods for the basket are determined by level of utility or satisfaction - that is, new useful goods can be added) framework from the 1940s to 1970s, and a partnership with the U.S. Census Bureau in the late 1970s. The first collective bargaining agreements - in which companies agreed to link workers' wages to the CPI to prevent strikes - occurred in the late 1940s and early 1950s.

Summing It All Up

Not only is the CPI an index of inflation - it represents cultural shifts in how we think about and consume goods and services. The shifting basket over time reflects changes in our day-to-day lives, the birth and/or death of different industries, and the changes in technology.

I'll admit, I wasn't really that interested in the CPI until I learned about the contributions of statisticians over the years. And it's an example of women making strong contributions to economic and statistical thought, so it's a shame that we don't hear more about it. In fact, statistician Dr. Janet Norwood, who joined the Bureau in 1963, and served as commissioner from 1979 to 1991, made some very important changes in her time there. For instance, a representative of the policy arm of the Department of Labor sat in on meetings about research results and press released from the Bureau - until Dr. Norwood stopped this practice to make sure economic information was seen as accurate and nonpartisan.

If you're now as fascinated as me, you can learn more about the CPI and its data here.