Saturday, June 24, 2017

Historical Children's Literature (And Why I'll Never Run Out of Reading Material)

Via a writer's group I belong to, I learned about the Baldwin Library of Historical Children's Literature, a digital collection maintained by the University of Florida. A past post from Open Culture provides some details:
Their digitized collection currently holds over 6,000 books free to read online from cover to cover, allowing you to get a sense of what adults in Britain and the U.S. wanted children to know and believe. Several genres flourished at the time: religious instruction, naturally, but also language and spelling books, fairy tales, codes of conduct, and, especially, adventure stories—pre-Hardy Boys and Nancy Drew examples of what we would call young adult fiction, these published principally for boys. Adventure stories offered a (very colonialist) view of the wide world; in series like the Boston-published Zig Zag and English books like Afloat with Nelson, both from the 1890s, fact mingled with fiction, natural history and science with battle and travel accounts.
The post highly recommends checking out the Book of Elfin Rhymes, one of many works of fantasy from the turn of the century - similar to a childhood favorite of mine, the Oz book series by L. Frank Baum, a world I continue to visit in my adult life through antique book collecting and occasional rereading. The illustrations of Elfin Rhymes are similar to the detailed illustrations you would find in a first edition (or reprinted vintage edition) of an Oz book:

And if you're looking for more classics (and beyond) to read for free, Open Culture shares a list of 800 free ebooks here. This is a good find considering I'm spending my afternoon cleaning out my bookshelf, putting books I've read (and am unlikely to reread soon) into storage to make room for new. My reading list continues to grow...

Friday, June 23, 2017

Map From the Past

I'm finally home from Colorado. On my flight yesterday (my 8th flight in the last month), I listened to a podcast from Stuff You Should Know on How Maps Work.

On this podcast, I learned about an international incident from 7 years ago that I missed at the time - Google Maps almost started a war:
The frenzy began after a Costa Rican newspaper asked Edén Pastora, a former Sandinista commander now in charge of dredging the river that divides the two countries, why 50 Nicaraguan soldiers had crossed the international frontier and taken up positions on a Costa Rican island. The ex-guerrilla invoked the Google Maps defense: pointing out that anyone Googling the border could see that the island in the river delta was clearly on Nicaragua’s side.
This dispute was one incident in a long line of border disputes between Costa Rica and Nicaragua, dating back to the 1820s. The Cañas–Jerez Treaty was enacted in 1858 to alleviate these tensions, and it seemed to work for a while. The International Court of Justice ruled on this small island in 2015, reaffirming that the disputed piece of land belongs to Costa Rica.

You can read an overview of this dispute here.

Tuesday, June 20, 2017

He's No Frank Underwood

Two special elections are happening today: one in the 6th Congressional district of Georgia - the race receiving the most attention - and one in the 5th Congressional district of South Carolina, which happens to be the home district of fictional politician, Frank Underwood of Netflix's House of Cards. And Democrat Archie Parnell seems to be having a great time highlighting this connection. Check out this campaign ad:

Harry Enten of FiveThirtyEight explains why this special election matters, despite receiving less attention:
Voters in the South Carolina 5th are choosing between Republican Ralph Norman, a former state representative, and Democrat Archie Parnell, a former Goldman Sachs managing director who has been using ads parodying Underwood to draw attention to his campaign.

[T]his is not the type of district where Democrats tend to be competitive. It’s not even the type of district where they need to be competitive to win the House next year. Democrats need a net gain of only 24 seats from the Republicans to do that. And there are 111 districts won by Republican House candidates in 2016 that leaned more Democratic than the South Carolina 5th.

There hasn’t been a lot of polling of the South Carolina race, but what we do have shows that Parnell is outperforming the district’s default partisan lean, just not by nearly enough.

Even if Norman wins, as expected, we will still learn something about the state of U.S. politics. As I’ve written before, when one party consistently outperforms expectations in special elections in the runup to a midterm election, that party tends to do well in those midterms.

So keep an eye on how much Parnell loses by (assuming he loses). The closer Norman comes to beating Parnell by 19 points (or more) — the default partisan lean of the district — the better for the Republican Party. A Parnell loss in the low double digits, by contrast, would be consistent with a national shift big enough for Democrats to win the House.

Monday, June 19, 2017

Alexa, Buy Whole Foods

Back in May, I shared a story from the Guardian that Whole Food's sales are declining and the company would be downsizing. The explanation was a combination of high prices (it's called Whole Paycheck for a reason) and increased availability of organic and specialty products at other grocery stores.

Friday, it was announced that Amazon would be buying Whole Foods:
Wall Street is betting Amazon (AMZN, Tech30) could be as disruptive to the $800 billion grocery industry as it has already proved to be for brick-and-mortar retail businesses.

Amazon already had a relatively small grocery business of its own, Amazon Fresh, but its acquisition of Whole Foods is much more ominous sign for competitors.

Traditional grocers are already struggling with fierce competition and falling prices. Amazon's war chest and online strength, coupled with Whole Foods' brand power, could force grocers to cut costs and spend heavily on e-commerce.

"For other grocers, the deal is potentially terrifying," Neil Saunders, managing director of GlobalData Retail, said in a report on Friday. "Amazon has moved squarely onto the turf of traditional supermarkets and poses a much more significant threat."
And of course, Twitter users had a lot to say about the deal:
Stock prices for other grocers fell Friday, totaling about $22 billion in market value. Obviously this isn't trivial, but after finishing Nassim Taleb's Fooled by Randomness recently, in which he specifically discusses randomness in the market, I'd be more interested in seeing what happens long-term (I'm expecting some regression to the mean soon).

And there's the big question - what will happen to Whole Foods? You can already buy groceries through Amazon, including more "mainstream" products you don't see in Whole Foods. Will Whole Foods become just another grocery store?

Sunday, June 18, 2017

Statistics Sunday: Past Post Round-Up

For today's post, I thought I'd share what I consider my favorite posts on statistics - in this case, favorite means either a topic I really love or a post I really enjoyed writing (and for certain posts, those two are the same thing). Here are my favorite statistics posts:

  • Alpha, one of the most important concepts in statistics, in which I also give a short introduction to probability
  • Error, which builds on probability information from previous posts, and starts to introduce the idea of explained and unexplained variance
  • N-1, a concept many of my students struggled to understand in introductory statistics - this post helped me solidify my thoughts on the topic, and I think I understand it much better for having written about it
  • What's Normal Anyway, my first Statistics Sunday post, which had the added bonus of proving to myself there is a way to explain skewness and kurtosis in a way people understand, and that these don't need to be considered advanced topics
  • Analysis of Variance, which used the movie theatre example I first came up with when I taught statistics for the first time - I remember overhearing my students during their final exam study sessions saying to each other, "Remember the movie theatre..."
I plan on getting back to writing regular posts soon, and have a list of statistics topics to sit down and write about. Stay tuned.

Friday, June 16, 2017


I haven't blogged in the last few days. Why? I'm back in Colorado again. (Sing that last line to the tune of Aerosmith's Back in the Saddle if you could.) A family health issue called me back and I'm writing this post from a dingy motel room with a large no smoking sign that I find hilarious because the room reeks of smoke - but it was the only place with a room available not too far from the hospital. But hey, I'm in Colorado, so here's what I'm doing for fun:
  • Trying all the Colorado beer - I'm currently having New Belgium Voodoo Ranger IPA in my dingy motel room; but I've recently had: Breckenridge Mango Mosaic Pale Ale; a flight at Ute Pass Brewing Company that included their Avery IPA, High Point Amber, Sir Williams English Ale, and Kickback Irish Red plus a tap guest of Boulder Chocolate Shake Porter; and an Oskar Blues Blue Dream IPA
  • Listening to all the podcasts, including an excellent one about how beer works from Stuff You Should Know, as well as some of my favorite regular podcasts from Part-Time Genius, WaPo's Can He Do That?, FiveThirtyEight Politics, StarTalk, Overdue, and Linear Digressions
  • Enjoying three new albums: Spoon's Hot Thoughts, Lorde's Melodrama, and Michelle Branch's (She's still making music! My college self is thrilled!) Hopeless Romantic
  • Reading The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb, which Daniel Kahneman said "changed my view of how the world works"; Kahneman, BTW, is a social psychologist with a Nobel Prize in Economics
  • Also reading (because one can never have too many books) Sports Analytics and Data Science: Winning the Game with Methods and Models by Thomas W. Miller - because I've been trying to beef up my data science skills and thought doing it with data I really enjoy (i.e., sports data) would help motivate me
  • Acquiring new skills such as hitching a fifth wheel (sadly I didn't discover or watch this video until long after hitching the fifth wheel), driving about 70 miles with said fifth wheel, and storing said fifth wheel - I'm considering adding these skills to my résumé
Tomorrow, I'm planning to spend a few hours checking out the Colorado Renaissance Festival. For now, here's a picture from the Garden of the Gods today:

Tuesday, June 13, 2017

What Democrats and Republicans Can Agree On

Yesterday, I listened to the FiveThirtyEight podcast in which they discussed "the base" - both Democratic and Republican - and they spent some time trying to operationally define what would be considered the base of these parties.

This is actually surprisingly difficult. As is said in the podcast, ideology (a continuum from liberal to conservative) and party affiliation (e.g., Democrat, Republican) are two different things, and although they do go together sometimes, they can also diverge. Determining whether a person is part of the Democratic or Republican base has to be more than simply determining if they're liberal or conservative. They also have to align with party activities and causes, and have a voting track record aligning with the party.

I highly recommend giving the podcast a listen.

In the podcast, they also talk about the parties more generally and even highlight some of the things Republicans and Democrats can agree on - specifically that the President should stay off of Twitter. So U.S. Representative Mike Quigley's COVFEFE (Communications Over Various Feeds Electronically for Engagement) Act is well-timed:
This bill codifies vital guidance from the National Archives by amending the Presidential Records Act to include the term “social media” as a documentary material, ensuring additional preservation of presidential communication and statements while promoting government accountability and transparency.

“In order to maintain public trust in government, elected officials must answer for what they do and say; this includes 140-character tweets,” said Rep. Quigley. “President Trump’s frequent, unfiltered use of his personal Twitter account as a means of official communication is unprecedented. If the President is going to take to social media to make sudden public policy proclamations, we must ensure that these statements are documented and preserved for future reference. Tweets are powerful, and the President must be held accountable for every post.”

In 2014, the National Archives released guidance stating its belief that social media merits historical recording. President Trump’s unprecedented use of Twitter calls particular attention to this concern. When referencing the use of social media, White House Press Secretary Sean Spicer has said, “The president is president of the United States so they are considered official statements by the president of the United States.”

Sunday, June 11, 2017

Statistics Sunday: Parametric versus Nonparametric Tests

In my posts about statistics, I've tried to pay some attention to the assumptions of different statistical tests. One of the key assumptions of many tests is that data are normally distributed. I should add that this is a key assumption for many of what we call 'parametric' tests.

Remember that in statistics lingo, parameter is the term we use to describe values that apply to populations, whereas statistics are values created with samples. When we try to generalize back to the population, we want our sample data to follow a similar distribution as the population - this distribution is often normal but not always. In any case, anytime we make/have assumptions about the distribution of data, we use parametric tests that include these assumptions. The t-test is considered a parametric test, because it includes assumptions about the sample (and hence, the population) distribution.

But if your data are not normally distributed, there are still many tests you can use, specifically ones that are known as distribution-free or 'non-parametric' tests. During April A to Z, I talked about Frank Wilcoxon. Wilcoxon contributed two tests that are analogues to the t-test, but have no assumptions about distribution.

To be considered a parametric test, it isn't necessary to have an assumption that data are normally distributed, because there are many types of distributions data can follow; an assumption of normality is a sufficient but not necessary condition. What is necessary to be a parametric test is to have some assumption of what the data should look like. If test assumptions make no mention about data distribution, it would be considered a non-parametric test. One well-known non-parametric test is the chi-square, which I'll blog about in the near future.

Saturday, June 10, 2017

Alan Smith on Why You Should Love Statistics

I happened upon this Ted Talk from earlier in the year, in which Alan Smith explains why he loves (and why you should love) statistics - his reason is very similar to mine:

Friday, June 9, 2017

Catching Up on Reading

I've been on vacation (currently in Denver) and haven't made time to blog, although I'm sure I'll be blogging regularly again when we return to Chicago. I'm still keeping up on reading my favorite blogs until now, but today will be spent squeezing in our last bit of Denver sightseeing before flying up to Montana to visit family for the weekend. So here's my reading list for when I get a little downtime at the airport:

Monday, June 5, 2017

Greetings from Colorado

I'm writing this post from my cabin in Woodland Park, CO, about 30 minutes from Colorado Springs. We flew in yesterday afternoon and despite a forecast of rain for our full visit, the weather is sunny and clear. Here's some photo highlights, with more to come:

We'll have to pick up some of this excellently named jerky when we go back to the airport.

The castle rock in the aptly named Castle Rock, CO.

As we got closer to Woodland Park, we drove through these gorgeous tree-populated hills...

and red rocks. I'll get better pictures when we head back to Colorado Springs later today for lunch at a brewery.

Our home for the next couple days in Woodland Park, CO.

Our cute cabin...

and my parents' cute dog, Teddy, who came to greet us shortly after our arrival.

We had a nice view of Pikes Peak at dinner last night. We'll have an even better view when we take the tram up the mountain tomorrow.

And because it's Colorado:

The ashtray right outside our cabin is clearly marked "Cigarettes only." Hmm, what else would people be smoking in Colorado? ;)

Sunday, June 4, 2017

Statistics Sunday: Linear Regression

Back in Statistics in Action, I blogged about correlation, which measures the numerical strength of a linear relationship between two variables. Today, I'd like to talk about a similar statistic, that differs mainly in how you apply and interpret it: linear regression.

Recall that correlation ranges from -1 to +1 (with 0 indicating no relationship, and the sign indicating the direction: one goes up the other goes up is positive and one goes up the other goes down is negative). That's because correlation is standardized: to compute a correlation, you have to convert values to Z-scores. Regression is essentially correlation, with a few key differences.

First of all, here's the equation for linear regression, which I'm sure you've seen some version of before:

y = bx + a

You may have seen it instead as y = mx + b or y = ax + b. It's a linear equation:

A linear equation is used to describe a line, using two variables: x and y. That's all regression is. The difference is that the line is used as an approximation of the relationship between x and y. We recognize that not every case falls perfectly on the line. The equation is computed so that it gets as close to the original data as possible, minimizing the (squared) deviations between the actual score and the predicted score. (BTW, this approach is called least squares, because it minimizes the squared deviations - as usual, we square the deviations so they don't add up to 0 and cancel each other out.)

As with so many statistics, regression uses averages (means). To dissect this equation (using the first version I gave above), b is the slope, or the average amount y changes for each 1 unit change in x. a is the constant, or the average value of y when x is equal to 0. Because we have one value for slope, we assume there is a linear relationship between y and x, that is the relationship is the same across all possible values. So regardless of which values we choose for x and y (within our possible ranges), we expect the relationship to be the same. There are other regression approaches we use if and when we think the relationship is non-linear, which I'll blog about later on.

Because our slope is the amount of change we expect to see in y and our constant is the average value of y for x=0, these two values are in the same units as our y variable. So if we were predicting how tall a person is going to grow in inches, y, the slope (b), and the constant (a) would all be in inches. If we use standardized values, which is an option in most statistical programs, our b would be equal to the correlation between x and y.

But what if we want to use more than one x (or predictor) variable? We can do that, using a statistic called multiple linear regression. We would just add more b's and x's to the equation above, giving each a subscript number (1, 2, ...). There are many cases where more than one variable would predict our outcome.

For instance, it's rumored that many graduate schools have a prediction (regression) equation they use to predict grad school GPA of applicants, using some combination of test scores, undergraduate GPA, and strength of recommendation letters, to name a few. They're not sharing what that equation is, but we're all very sure they use them. The problem when we use multiple predictors is that they are probably also related to each other. That is, they share variance and may predict some of the same variance in our outcome. (Using the grad school example, it's highly likely that someone with a good undergraduate GPA will also have, say, good test scores, making these two predictors correlated with each other.)

So when you conduct multiple linear regression, you're not only taking into account the relationship between each predictor and the outcome; you're also correcting for the fact that the predictors are correlated with each other. So when you're conducting multiple regression, you want to check the relationship between your predictors. If two variables are highly related to each other, to the point that one could be used as a proxy for the other, your variables are collinear, meaning that they predict the same variance in your outcome. Weird things happen when you have collinear variables. If the shared variance is very high (almost full overlap in a Venn diagram), you might end having a variable that should have a positive relationship with the outcome showing a negative slope. This is because one variable is correcting for overprediction; if this happens, we call it suppression. The only way to deal with it is to drop one of the collinear variables.

Obviously, it's unlikely that your regression equation will perfectly describe the relationship between/among variables. The equation will always be an approximation. So we measure how good our regression equation is at predicting outcomes using various metrics, including the proportion of variance in the outcome variable (y) predicted by the x('s), as well as how far the predicted y's (using the equation) are from the actual y's - we call this metric residuals.

In a future post, I'll show you how to conduct a linear regression. It's actually really easy to do in R.

Saturday, June 3, 2017

In Good Taste

Several years ago, while I was still in grad school and teaching college classes regularly, I attended a workshop at the Association for Psychological Science Teaching Institute (which occurs right before the full APS conference). The workshop was a demonstration of different taste perception activities one could use in either an introductory psychology or sensation & perception course. One activity used paper that had been soaked in a bitter tasting chemical (probably phenylthiocarbamide); you placed the paper on the tip of your tongue. This activity allows people to identify whether they're a "super-taster" meaning they have a lot of bitter tastebuds. My reaction to the bitter taste was immediate, meaning I'm a super-taster. I was also one of the youngest people in the room, and the person running the workshop went on to share that children have more bitter tastebuds than adults, which may explain why they don't tend to like bitter-tasting foods, like brussel sprouts or broccoli, as much as adults.

Our tastes really do change over time, and there's also a lot of individual differences when it comes to taste, even among people from the same age-group. This month's FiveThirtyEight Sparks podcast involves a discussion about differences in taste, as well as an interview with Bob Holmes, author of Flavor: The Science of Our Most Neglected Sense:

The group also does some flavor tripping. We did a little of that in the APS workshop, and a few years ago I attended a flavor tripping party with a few friends.

Friday, June 2, 2017

State Maps

You've probably seen the most recent state map making the rounds, which displays the most often misspelled word in all 50 states. XKCD had a brilliant response: