Deeply Trivial: November 2017

Thursday, November 30, 2017

Comments Supporting the Repeal of Net Neutrality Are Likely Fake

Trump loves talking about the "millions of illegal voters." Well, there are millions of pro-repeal net neutrality comments that are very likely fake:

NY Attorney General Schneiderman estimated that hundreds of thousands of Americans’ identities were stolen and used in spam campaigns that support repealing net neutrality. My research found at least 1.3 million fake pro-repeal comments, with suspicions about many more. In fact, the sum of fake pro-repeal comments in the proceeding may number in the millions. In this post, I will point out one particularly egregious spambot submission, make the case that there are likely many more pro-repeal spambots yet to be confirmed, and estimate the public position on net neutrality in the “organic” public submissions.

Below, you can see in highlight some of the phrases that popped up again and again with highly similar syntax, and just a synonym switched out every once in a while:

Also suspect is the fact that the pro-repeal comments are more duplicative than the anti-repeal comments, meaning if these 1.3 million similarly worded comments were coming from grassroots efforts, you should see lots of duplication in comments on both sides. People writing in favor of keeping net neutrality deviate much more from the form letter.

So what truth can be gleaned from these comments?:

It turns out old-school statistics allows us to take a representative sample and get a pretty good approximation of the population proportion and a confidence interval. After taking a 1000 comment random sample of the 800,000 organic comments and scanning through them, I was only able to find three comments that were clearly pro-repeal. That results in an estimate of the population proportion at 99.7%. In fact, we are so near 100% pro net neutrality that the confidence interval goes outside of 100%. At the very minimum, we can conclude that the vast preponderance of individuals passionate enough about the issue to write up their own comment are for keeping net neutrality.

Uptown Rats

I have a confession: I'm a rat lover. I had a pet rat named Lily in college, and she was the best pet I'd ever had: personable, smart, absolutely adorable. Don't get me wrong, I wouldn't snuggle up to a sewer rat - Lily was as good as she was in part because she was the offspring of lab rats and also because she was highly socialized (exposed to people) from the very beginning of her life.

Still, I find this study really cool:

For the past two years, [Fordham University graduate student, Matthew] Combs and his colleagues have been trapping and sequencing the DNA of brown rats in Manhattan, producing the most comprehensive genetic portrait yet of the city’s most dominant rodent population.

As a whole, Manhattan’s rats are genetically most similar to those from Western Europe, especially Great Britain and France. They most likely came on ships in the mid-18th century, when New York was still a British colony.

When Combs looked closer, distinct rat subpopulations emerged. Manhattan has two genetically distinguishable groups of rats: the uptown rats and the downtown rats, separated by the geographic barrier that is midtown. It’s not that midtown is rat-free—such a notion is inconceivable—but the commercial district lacks the household trash (aka food) and backyards (aka shelter) that rats like. Since rats tend to move only a few blocks in their lifetimes, the uptown rats and downtown rats don’t mix much.

To collect this genetic information, Combs and colleagues trapped rats - using an enticing bait made from peanut butter, bacon, and oats - and collected tissue samples, mostly from the tail. He also collected information on rats' locations, by crowd-sourcing data:

Fortunately, it sounds like Combs has come to appreciate rats as well:

After two years of trapping rats, Combs has come to respect the enemy. At the end of our conversation, he launched into an appreciation of rats—their ability to thrive on nearly anything, their prodigious reproduction, and their complex social structure, in which female rats will give birth all at the same time and raise their offspring in one nest. “They are, quote-unquote, vermin, and definitely pests we need to get rid of,” he says, “but they are extraordinary in their own ways.”

Wednesday, November 29, 2017

Winner, Winner, Chicken Dinner!

That's right, kids, I did it:

This was my third NaNo and my second win. Phew!

America's Existential Dread, Expressed as Christmas Decorations

If Melania Trump was trying to express how terrified we all are over the state of the country in Christmas decorations, she has succeeded admirably:

From light to darkness, an American Horror Story. pic.twitter.com/YlCgAr7k5s

— BWD 🤢 (@IrisRimon) November 28, 2017

And if you'd like to see more hilarious reactions to This Nightmare Before Christmas (Part 2, I believe, since Part 1 was the day after Election Day 2016), Bored Panda was kind enough to compile this list.

Statistical Sins: When the Data are Too Perfect

Yesterday, Ars Technica published an article about an investigation into the research of Nicolas Guéguen, a psychologist who has received a great deal of media attention for his shocking findings in gender effects. His research includes findings that men prefer women in heels or wearing red, and that men are more likely to help a woman wearing her hair down instead of up. But, according to James Heathers and Nick Brown, Guéguen's data and high publication rate are suspect:

What they've found raises a litany of questions about statistical and ethical problems. In some cases, the data is too perfectly regular or full of oddities, making it difficult to understand how it could have been generated by the experiment described by Guéguen.

Social media is where it all kicked off, when Nick Brown saw a tweet about a paper claiming that men were less likely to help a woman who had her hair tied up in a ponytail or a bun. When they looked more closely at the paper, something odd jumped out at them: the numbers in the paper looked strangely regular.

When you’re dividing by three, the decimal points will always follow this pattern: either .000, .333, or .666. If you divide by 30, the pattern just moves up a decimal place: the second decimal will always be 3 or 6.

In this study, every average score was divided by 30, because each group (male-ponytail, male-loose, female-bun, and so on) had 30 people in it. But every average number was perfectly round: 1.80, 2.80, 1.60. That’s … unlikely. “The chance of all six means ending in zero this way is 0.0014,” write Heathers and Brown in their critique.

Many of Guéguen's studies involve elaborate situations using confederates - research assistants who pretend to be a participant or random person on the street. But Guéguen publishes many single author papers without acknowledgements. When Heathers and Brown reached out to Guéguen for more information on how he could publish so many elaborate studies on his own, he explained that he supervises many student projects. But why aren't the students at least thanked in the papers? Or listed as a coauthor, as they should be if they're doing a great deal of the work?

Heathers and Brown have repeatedly reached out to Guéguen for some documentation to substantiate that these studies occurred as described, but email correspondence, ethics review committee reports, and original datasets have not been shared in many cases.

Brown will be publishing the results of his and Heathers's examination of Guéguen's work on his blog. The first critique can be found here.

As has happened before, this particular instance of alleged academic dishonesty is liable to lead to a discussion about the problems of the "publish or perish" mentality in academia and research. But, as in previous cases, it's unlikely that such a discussion will result in any real improvements to dissuade such honesty. When the benefit of publishing a great deal is high and the probability of being caught is low, these things will continue to happen. Completely fabricated research is rare and likely to remain so, but tiny slips into academic dishonesty - massaging numbers or dropping cases - will happen, even by the most honest of researchers.

Tuesday, November 28, 2017

So Close!

After falling dangerously behind, I've been able to play catchup on NaNoWriMo. Here's my current stats:

I only need to write 3600ish more words between now and midnight Thursday. I can do this!

It's a-Me!

Fellow writer, Katie Roman, invited members of the Chicago NaNoWriMo to answer some interview questions. Today, my "Author Spotlight" is up! I got to talk about my love Ray Bradbury and Margaret Atwood, why self-doubt is public enemy number 1 to my writing, and why bars are better than coffee shops for people-watching and writing inspiration. Plus, we were asked to share a meme that describes our writing. Here's mine:

Monday, November 27, 2017

Statistics Sunday: Data Discovery

For today's (late) Statistics Sunday post, I was going to dig into FiveThirtyEight's Thanksgiving data, to find the real reason people in the West eat so much salad at Thanksgiving. As I was inspecting the data and readme file, I clicked back in the directory and found that FiveThirtyEight has shared a ton of data on GitHub. So instead of analyzing Thanksgiving data, I clicked through readme files of other data they had available.

Yes, I became distracted by new data.

Some favorites among the list:

Saturday, November 25, 2017

Perfect Fit to the Data

That's one way to do it:

Also this:

Friday, November 24, 2017

Statistical Sins: Thanksgiving Edition

Hopefully you had a wonderful Thanksgiving. I ended up traveling yesterday, and that combined with some illnesses in my family means I still haven't gotten the "official" Thanksgiving meal (turkey, sweet potatoes, etc.). That should happen tomorrow.

On the subject of Thanksgiving, FiveThirtyEight recently re-released that results of a survey from a couple years ago, showing the most disproportionately consumed Thanksgiving side dish:

Here’s the most disproportionately consumed Thanksgiving side dish in each region: https://t.co/WDpqVXnoSY pic.twitter.com/j8GuNSxmjx

— FiveThirtyEight (@FiveThirtyEight) November 22, 2017

And yes, I can say Kansans love their casseroles, especially green bean casserole. I've been to family Thanksgivings where there have been, I kid you not, 3 different versions of green bean casserole.

But what people are really reacting to is that the West coast just loves their salad:

I don't say this sentence often or lightly, so it should be savored by those of you inclined:

The rest of us could learn something from the South https://t.co/31FY6cF72Q
— Doug (@moonwalkmcfly) November 22, 2017

But some savvy Tweeters are suggesting that this could be an issue with the survey itself or analysis of the resulting data:

So I’m guessing that the pollsters got a range of answers about various types of cold side dish, and egregiously lumped them all together in the category of “salad,” leading to all of your justifiable anger.
— Emily Dagger (@AbbottRabbit) November 22, 2017

Want to try to answer this question yourself? FiveThirtyEight was kind enough to share the data on GitHub. Once I get some writing done, I might have to dig into this dataset.

Tuesday, November 21, 2017

If You Want to Terrify Your Family...

... you could make this roasted face hugger for Thanksgiving:

This dish is created with a whole chicken, snow crab legs, and chicken sausage. And it's only slightly more repulsive than the unholy mashup that is turducken.

Life in an Iron Lung

Those who forget the past are doomed to repeat it. And one of the situations where that aphorism is absolutely true is when it comes to vaccines. I feel very strongly about this topic, so strongly that I will feed the trolls even when I shouldn't and have unfriended people on Facebook who refuse to vaccinate their children.

No one dies from autism (or whatever anti-vaxxers are currently concerned about). But people die from preventable diseases that have been eradicated by vaccines all the time. Tens of thousands of people die from seasonal influenza each year. During the Spanish flu epidemic in 1918, 50-100 million people died from a strain of H1N1 - the same strain that reemerged in 2009 (the "Swine flu"). And other conditions we are vaccinated for, like the measles or diptheria, are far deadlier.

One of the most debilitating conditions for which we can vaccinate is poliomyelitis, or polio. People who survived polio went on to have a variety of complications, ranging from mobility limitations and muscle weakness to paralysis and severe difficulty breathing. Yesterday, Jennings Brown for Gizmodo published a story about 3 polio survivors who use an apparatus called an iron lung that helps them breathe. Two of them use the iron lung while they sleep, so they don't stop breathing in the night. One of them uses the iron lung nearly all the time.

The stories these 3 told were heart-breaking:

When [Martha] Lillard was a child, polio was every parent’s worst nightmare. The worst polio outbreak year in US history took place in 1952, a year before Lillard was infected. There were about 58,000 reported cases. Out of all the cases, 21,269 were paralyzed and 3,145 died.

Children under the age of five are especially susceptible. In the 1940s and 1950s, hospitals across the country were filled with rows of iron lungs that kept victims alive. Lillard recalls being in rooms packed with metal tubes—especially when there were storms and all the men, women, adults, and children would be moved to the same room so nurses could manually operate the iron lungs if the power went out. “The period of time that it took the nurse to get out of the chair, it seemed like forever because you weren’t breathing,” Lillard said. “You just laid there and you could feel your heart beating and it was just terrifying. The only noise that you can make when you can’t breathe is clicking your tongue. And that whole dark room just sounded like a big room full of chickens just cluck-cluck-clucking. All the nurses were saying, ‘Just a second, you’ll be breathing in just a second.’”

The polio vaccine nearly eradicated cases in the United States. There are still cases in parts of the world where vaccines are not readily available. The article comments that if one infected person were to visit Orange County, California, where many parents are opting out of vaccination, we could have a polio epidemic in the US for the first time in decades.

New iron lungs haven't been manufactured in many years, and new parts aren't available any longer either. In the article, the author observes that these 3 polio survivors are fortunate to have mechanically oriented people in their lives who have fixed and maintained their iron lungs. Iron lung users who were unable to find people with these skills died as a result.

The most important message in this article, spoken by experts as well as the 3 survivors is vaccinate:

But another thing they all had in common is a desire for the next generations to know about them so we’ll realize how fortunate we are to have vaccines. “When children inquire what happened to me, I tell them the nerve wires that tell my muscles what to do were damaged by a virus,” Mona [Randoloph] said. “And ask them if they have had their vaccine to prevent this. No one has ever argued with me.”

[Paul] Alexander told me that if he had kids he would have made sure they were vaccinated. “Now, my worst thought is that polio’s come back,” he said. “If there’s so many people who’ve not been—children, especially—have not been vaccinated... I don’t even want to think about it.”

Lillard is heartbroken when she meets anti-vaccine activists. “Of course, I’m concerned about any place where there’s no vaccine,” she said. “I think it’s criminal that they don’t have it for other people and I would just do anything to prevent somebody from having to go through what I have. I mean, my mother, if she had the vaccine available, I would have had it in a heartbeat.”

Monday, November 20, 2017

Video RoundUp

Once I have some free time later this week, here are some videos on my "To Watch" list:

Sunday, November 19, 2017

Statistics Sunday: What Are Cognitive Interviews?

In a recent Statistical Sins post, I talked about writing surveys and briefly mentioned the concept of cognitive interviews. Today, I wanted to talk a little more about what they are and how they can be used to improve surveys and measures.

The purpose of a cognitive interview is to get into the mind of the individual, to understand their thought process. When conducting cognitive interviews for a survey or measure, your goal is to understand the respondents' thought process while completing the instrument.

There are two different ways you can conduct these interviews. The first is the think aloud technique. As the respondent completes the measure, you ask him or her to narrate the thought process. As I said in the statistical sins post, when people encounter a measure, they read an item and then determine their response. They then compare their internal response to the options given, and in essence translate their personal response to a supplied answer.

The problem I've encountered is that people can't always verbally narrate what they're thinking. Thought isn't always in words and sentences, and to communicate those thoughts, one must first translate nonverbal information into verbal information. Even when I want to do a think aloud cognitive interview, it will probably use elements of the second approach, direct probing.

For this approach, the researcher asks the respondent questions while he or she is looking at your instrument - specific questions that get at how they're approaching the instrument, and ways it could be improved. For instance, you might ask respondents what they think about when they hear a particular word or phrase that shows up in your instrument. This can help you identify potential misunderstandings or other words/phrases that would be more clear. You might ask if they like the response options and whether any options are missing. You could even ask if important items or questions are missing.

It is possible to do a mixture of the two. For example, I may have a person go through a measure while narrating their thoughts. I'll usually have a general question to help people keep narrating as they complete the measure - basically questions I use if they lapse into silence. Then, I'll have more targeted questions to help get specific responses to issues I care most about. Anytime I do interviews, I like to go from general to more specific topics, so this approach to cognitive interviews works well for me.

This approach should happen further along in the development of your survey or measure. If you're stil trying to figure out what should go on your instrument, you'd be best to use different methods like literature reviews, convening an expert committee, or focus groups. But once you get father along in your process, cognitive interviews are an important way to make sure you get good data from the instrument you've exerted a lot of effort to create.

Friday, November 17, 2017

Candy is Dandy But Liquor is Quicker

In yesterday's Chicago Tribune, Josh Noel longs for the day when beer tasted like beer:

After six hours wandering the aisles of the Festival of Wood and Barrel-Aged Beer last weekend, I have concluded that craft beer is betraying itself. It is forgetting what beer should taste like.

Though FOBAB, held this year at the University of Illinois at Chicago Forum on Friday and Saturday, remains Chicago’s most essential beer festival, corners of it have become a showcase for beer that tastes more like dessert than beer. “Pastry stouts,” the industry calls them.

Among the 376 beers poured at FOBAB this year, about 50 were pastry stouts, the largest share of the largest category at FOBAB. [These] beers are overrun with coffee, vanilla beans, coconut, cinnamon, chiles and cacao nibs.

So very many cacao nibs.

It’s a confounding moment in craft beer. The industry is still growing rapidly — 6,000 breweries operating and hundreds more in planning — and the race is on for differentiation. The problem is that the differentiation is seeming both too sweet and too repetitive.

I'll admit, I'm torn on this issue. I completely agree that many of these beers are far too sweet, not really tasting "like beer" but more like flavored syrup. But this isn't a new phenomenon. For instance, look at Reinheitsgebot, the German Beer Purity Laws, which were established centuries ago. I'm sure many of the beers Noel holds up as examples that "taste like beer" would fail to pass these purity laws.

If we go back to the very beginning of brewing - and that's a long way back, because beer is one of the first beverages humans produced; it even predates wine by about 2000 years - the first drinks we call beer are absolutely nothing like what we have today, brewed from barley and different kinds of grain. What makes something beer isn't so much about flavor or even the precise ingredients, but the process used to make it.

One of the reasons I love beer is the nuances of flavor. The many different varieties of beer have resulted from centuries of innovation. And I'm sure as each new style was invented, someone was lamenting for the days when the beer tasted like beer. While I may not love the pastry stouts, and instead prefer beers that are more dry or bitter, I don't see any issue with this new trend. It highlights exactly what I love about beer - variety.

Besides, I'll try anything barrel-aged, even if I worry it will be too sweet for me.

Thursday, November 16, 2017

Finding Utopia

After an especially long workday yesterday, that started around 8:30 am and ended around 8:00 pm, I decided to grab a burger and a beer at one of my favorite beer gardens. Not only did I get to catch the end of the Hawks game, I was thrilled to see that they had Samuel Adams Utopias 2017 on their draft menu.

You might remember I posted about this beer a little over a week ago. This 2 ounce pour cost me $26, much cheaper than picking up my own bottle for $199 - not to mention, I'd need a bunch of friends to share that bottle with, since this beer comes in at 28% ABV. This is a beer one should sip slowly, and I did - my dinner companion finished an entire pint in the time it took me to finish my 2 ounces.

The beer was lovely: rich, malty, and slightly sweet. It's served room temperature, as most cask-aged beers are, and is not carbonated, because of the high alcohol content. If you're a beer-lover and get the chance to try it, I highly recommend it.

Wednesday, November 15, 2017

Whoa, We're Halfway There

Reached (and surpassed) 25,000 words:

And because I just can't leave a song lyric unfinished:

NaNoWriMo Hump-Day: Some Resources for Day 15 (And Beyond)

We're reaching the midpoint of NaNoWriMo - and on a Wednesday, so today is like a Super Hump-Day. By the end of today, I plan to have at least 25,000 words written if it kills me. So if you too need help to get through the humpiest of all hump-days, here are some resources:

When your writing is just too "very," this list gives you replacements for "very + [adjective]"
Speaking of the middle of things, here's some advice on giving some love to the middle child of your novel, the middle act
Jeff Goins says the way to be a good writer is practice, practice
Daily Writing Tips pens a list: 40 shades of -ade
And if you're feeling self-doubt about how you could possibly write a novel [raises hand], know you're in good company

Tuesday, November 14, 2017

Where Serendipity Takes Me

I'm having one of those evenings where you feel like there is some strange force in the universe guiding you. Generally I don't believe in that kind of thing. But every once in while, things like this happen in a chain that feels anything but random.

It's Tuesday, when I have belly dance class, which means I drive to work and head up to Evanston after the workday.

On my way to Evanston, one of my favorite songs ever comes on the radio: Head Over Heels by Tears for Fears. It was on a station I don't normally listen to but I had switched over due to an incredibly annoying ad on my usual station.

I realize I'm craving tacos so I stop in one of my favorite restaurants where I run into two good friends and have some delicious chicken tacos.

On my way out, I stop to pet a very sleepy 8-week-old golden retriever.

I round the corner to my class... and see the lights are dark and a closed sign is on the door. I think about returning to the taco place but instead decide to head to my favorite tap room.

So now I'm writing and listening to an Irish folk duo while sipping beer.

Life is good.

A Face Only a Mother Could Love

A couple of prehistoric sea creatures appeared recently, one off the coast of Portugal and the other in the harbors of Sydney, Australia. And they've got faces you just can't unsee:

[Shudder.]

In fact the frilled shark, which has 25 rows of nasty big pointy teeth, is a living fossil, because remains of this creature have been found dating back as much as 80 million years.

Our planet is both cool and terrifying.

Monday, November 13, 2017

Statistics Sunday: What is Bootstrapping?

Last week, I posted about the concept of randomness. This is a key concept throughout statistics. For instance, I may have mentioned before but many statistical tests assume that the cases used to generate the statistics were randomly sampled from the population of interest. That, of course, rarely happens in practice, but this is a key concept in what we call parametric tests - tests that compare to an assumed population distribution.

The reason for this focus on random sampling goes back to the nature of probability. Every case in the population of interest has a chance of being selected - an equal chance in simple random sampling, and unequal but still predictable chances when more complex sampling methods are used, like stratified random sampling. It's true that you could, by chance, draw a bunch of really extreme cases. But there are usually fewer cases in the extremes.

If you look at the normal distribution, for instance, there are so many more cases in the middle that you have a much higher chance of drawing cases that fall close to the middle. This means that, while your random sample may not have as much variance as the population of interest, your measures of central tendency should be pretty close to the underlying population values.

So we have a population, and we draw a random sample from it, hoping that probability will work in our favor and give us a sample data distribution that resembles that population distribution.

But what if we wanted to add one more step, to really give probability a chance (pun intended) to work for us? Just as cases that are typical of the population are more likely to end up in our sample, cases that are typical of our sampling distribution are more likely to end up in a sample of the sample. (which we'll call subsample for brevity's sake) And if we repeatedly drew subsamples and plotted the results, we could generate a distribution that gets a little closer to the underlying population distribution. Of course, we're limited by the size of our sample, in that our subsamples can't exceed that size, but we can bypass that by random sampling with replacement. That means that after pulling out a case and making a note of it, we put it back into the mix. It could get drawn again. This gives us a theoretically limitless sample from which to draw.

That's how bootstrapping works. Bootstrapping is a method of generating unbiased (though it's more accurate to say less biased) estimates. Those estimates could be things like variance or other descriptive statistics, or it could be used in inferential statistical analyses. Bootstrapping means that you use random sampling with replacement to estimate values. Frequently, it means using your observed data as a sort of population, and repeatedly drawing large samples with replacement from that data. In our Facebook use study, we used bootstrapping to test our hypothesis.

To summarize, we measured Facebook use among college students, and also gave them measures of rumination (tendency to fixate on negative feelings), and subjective well-being (life satisfaction, depression, and physical symptoms of ill health). We hypothesized that rumination mediated the effect of Facebook use on well-being. Put in plain language, we believed using Facebook made you more likely to ruminate, which in turn resulted in lower well-being.

The competing hypothesis is that people who already tend to ruminate use Facebook as an outlet for rumination, resulting in lower well-being. In this alternative hypothesis, Facebook is the mediator, not rumination.

Testing mediation means testing for an indirect effect. That is, the independent variable (Facebook use) affects the dependent variable (well-being) indirectly through the mediator (rumination). We used bootstrapping to estimate these indirect effects; we took 5000 random samples of our data to generate our estimates. Just as we're more likely to draw cases typical of our sample (which are hopefully typical of our population), we're more likely to draw samples that (hopefully) have the typical effect of our population. The resulting indirect effects we get from bootstrapping won't be the same as a simple analysis of our observed data. We're using probability to remove bias from our estimates.

And what did we find in our Facebook study? We found stronger support for our hypothesis than the alternative. That is, we had stronger evidence that Facebook use leads to rumination than the alternative that rumination leads to Facebook use. If you're interested in finding out more, you can read the article here.

Friday, November 10, 2017

Amazon Editors' Top 100 Books

It's November, everyone is already talking about winter holidays, and I've already been forced to listen to hours of Christmas music. But one thing that I like about this time of year is the various year-end lists. As I mentioned before, Goodreads is having its members vote for the Best Books of 2017. I'll be sharing their final list when voting ends. In the meantime, here's the top 100 books of the year according to Amazon's editors:

Sadly, I've only read one of the books on the list: Norse Mythology, by Neil Gaiman. This makes me slightly sad, not only because I've clearly missed some great books published this year, but also because some of the wonderful books I read this year didn't make the cut. Where's Lesley Nneka Arimah's What It Means When a Man Falls From the Sky or Kate Moore's The Radium Girls?

Tuesday, November 7, 2017

Is DeVos On Her Way Out?

I've said a lot about my thoughts on Betsy DeVos before, so I won't repeat all of that. But officials believe she's going to resign:

Thomas Toch, director of independent education think tank FutureEd, told Politico that DeVos was ignorant of the job's constraints when she accepted it and insiders are already preparing for her to vacate the position.

DeVos was roundly criticized for her lack of basic knowledge about education policy during the confirmation hearing process. She blames President Donald Trump's transition team, claiming she was "undercoached."

Do you believe in miracles? Because I might now.

Seriously, though: anyone else find it ironic that an administration that claims it values meritocracy and hates "entitled snowflakes" is full of people who were handed money and power, and constantly blame other people for their own shortcomings and shitty performance?

Link Roundup

As I continue working on our content validation study, I have a bunch of links open that I'll read as a reward for finishing my next big task:

So good it really is illegal: Apparently Samuel Adams released a beer that costs $199 and is 28% ABV, making it illegal in 12 states. The beer, called Utopias (hmmm, wonder why?), is a mixture of various batches, some of which have been aged 24 years. The aging process is done in a variety of wooden barrels, including barrels for Bourbon, White Carcavelos, Ruby Port, Aquavit, and Moscat. The recommended serving size is 1 ounce.
Janelle Shane over at AIWeirdness (who gave us neural network paint names) is celebrating NaNoWriMo by using a neural network to generate some first lines for a potential novel. The results ranged from bizarre nonsense to strange poetry. Also, she's asking readers to share first lines, including their own first lines from novels they've written/are writing. Contribute using this form. I hit a bit of a wall in my own novel, and barely wrote this weekend. So on my train ride this morning, I started working on the outline I said I wasn't going to make. While I'd love to follow Stephen King's writing advice exactly, I'm just too much of a plantser.
Today's Google Doodle honors Pad Thai. And now I'm craving noodles.

Sunday, November 5, 2017

Statistics Sunday: Random versus Pseudo-Random

One of the key concepts behind statistics is the idea of "random" - random variables, random selection, random assignment (when we start getting into experimentation and the analyses that go along with that), even random effects. But as with likelihood, this is a term that gets thrown around a lot but not really discussed semantically.

Exacerbating the problem is that random is often used colloquially to mean something very different from its meaning in scientific and statistical applications. In layman's terms, we often use the word random to mean something was unexpected. Even I'm guilty of using the word "random" in this way - in fact, one of my favorite jokes to make is that something was "random, emphasis on the dom (dumb)."

But in statistics, we use random in very specific ways. When we refer to random variables, we mean something that is "free to vary." When we start talking about things like random selection, we usually mean that each case has an equal chance of being chosen, but even then, we mean that the selection process is free to vary. There is no set pattern, such as picking every third case. In either of these instances, the resulting random thing is not unexpected. We can quantify the probability of the different outcomes. But we're allowing things to vary as they will.

There are a variety of instances of random processes, what we something call stochastic processes. You may recall a previous blog post about random walks and martingales. Things like white noise, and the behavior of the stock market are examples of stochastic processes, Even the simple act of flipping a coin multiple times is a sort of stochastic process. We can very easily quantify an outcome, or even a set of outcomes. But we allow each outcome to vary naturally.

Unsurprisingly, people often use computers to generate random numbers for them. Computers are great for generating large sets of random numbers. In fact, I often use R to generate random datasets for me, and can set constraints on what I want that dataset to look like. For instance, let's say I want to generate a random dataset with two groups, experimental and control, and I want to ensure they have different means but similar standard deviation. I did something much like this when I demonstrated the t-test:

experimental<-data.frame(group=1, score=rnorm(100,75.0,15.0)
control<-data.frame(group=2, score=rnorm(100,50.0,15.0)
full<-rbind(experimental,control)
library(psych)
describe(full)

This code gives me a dataset with 200 observations, 100 for each group. The experimental group is fixed to have a mean of approximately 75, and the control group a mean of approximately 50. The rnorm command tells R that I want a random, normally distributed dataset.

Based on the describe command, the overall dataset has a mean score of 63.18, a standard deviation of 19.94, skewness of 0.12 and kurtosis of -0.69. A random dataset, right?

But...

That's right, computers don't actually give you random numbers. They give you pseudo-random numbers: numbers that have been generated to mimic stochastic processes. For all intents and purposes, you can call them random. But technically, they aren't. Instead, there are sets full of pseudo-random numbers that are used when you ask the program to generate random numbers for you.

But there is an upside to this. You can recreate any string of random numbers anytime you need to. You do this by setting your seed - telling the program to use a specific set of random numbers. This means that I can generate some random numbers and then later, recreate that same set. Let's test this out, shall we? First, I need to tell R to use a specific random number seed.

set.seed(35)

experimental<-data.frame(group=1, score=rnorm(100,75.0,15.0)
control<-data.frame(group=2, score=rnorm(100,50.0,15.0)
full<-rbind(experimental,control)
library(psych)
describe(full)

This generates a dataset with the following descriptive results: mean = 66.02, SD = 19.06, median = 66.43, min = 22.54, max = 125.07, skewness = 0.09, and kurtosis = -0.22.

Now, let's copy that exact code and run it again, only this time, we'll change the names, so we generate all new datasets.

set.seed(35)

experimental2<-data.frame(group=1, score=rnorm(100,75.0,15.0)
control2<-data.frame(group=2, score=rnorm(100,50.0,15.0)
full2<-rbind(experimental2,control2)
library(psych)
describe(full2)

And I get a dataset with the following descriptive results: mean = 66.02, SD = 19.06, median = 66.43, min = 22.54, max = 125.07, skewness = 0.09, and kurtosis = -0.22.

Everything is exactly the same. This is because I told it at the beginning to use the same seed as before, and by calling the set.seed command a second time, I also asked the program to go back to the beginning of that seed. As a result I can recreate my "randomly" generated dataset perfectly. But because I can recreate it every time, the numbers aren't actually free to vary, so they are not truly random.

Friday, November 3, 2017

Give This Person an Award - Please

A Twitter employee celebrated the last day on the job... by deleting Donald Trump's Twitter account:

Donald Trump’s account disappeared for 11 glorious minutes yesterday, thanks to a Twitter customer service employee on their last day on the job.

Initially, Twitter claimed it was caused by “human error.”

But a couple hours later, the company clarified that it wasn’t a mistake at all.

“Through our investigation we have learned that this was done by a Twitter customer support employee who did this on the employee’s last day,” Twitter explained. “We are conducting a full internal review.”

Can someone find this mysterious former Twitter employee and give them a medal or something?

Thursday, November 2, 2017

Books, Books, and More Books

November is a great month for books. First of all, it's NaNoWriMo. (I may have mentioned that before.) I had a great first day, writing over 5000 words! And I've already met my goal for today (which I set much lower because I knew I'd have less free time), over my lunch break. I'm back up to a higher word count goal tomorrow, since I'll be traveling and will have some downtime at the airport and on the plane. You can track my progress through my word count widget, which is at the bottom of the right column (just under the archive listing).

November is also the month you can vote on the Best Books of 2017 on Goodreads! I was thrilled to see many of the books I read this year on the lists, which are broken down by genre. The first round of voting is going on now. There will be a semifinal round starting next Tuesday, and a final round the following Tuesday. So get in there and vote!

Finally, some awesome books are coming out this month, including a parody Donald Trump autobiography, authored by Alec Baldwin and Kurt Anderson:

Other books I absolutely will read is a forthcoming biography of Stevie Nicks (Gold Dust Woman by Stephen Davis), a book about jellyfish (Spineless: The Science of Jellyfish and the Art of Growing a Backbone by Juli Berwald), and a novel about a group in Vermont who attempt to secede from the United States (Radio Free Vermont: A Fable of Resistance by Bill McKibben).

Statistical Sins: Hello from the 'Other' Side

I'm currently analyzing data from our job analysis survey (in fact, look for a post in the near future about why it's important for psychometricians to remember how to solve systems of equations), and saved the analysis of demographics for last. Why? Because I'm fighting a battle with responses in the 'Other' category for some of these questions. I think I'm winning. Maybe.

I've blogged about survey design before. But I've never really discussed this concept of having an 'Other' category in your items. You assume when you write a survey that people will find a selection that matches their situation and for those few cases where no options match, you have 'Other' to capture those responses. You then ask people to specify why they are 'Other' so you can create additional categories to capture them in tables.

Or, you know, you could have what people actually do and end up with a bunch of people whose situation perfectly fits one of the existing categories and instead selects 'Other,' then writes an almost word-for-word version of an existing category in the specify box. For instance, in one item on our survey, we had 36 people select 'Other'. After I had looked at their responses and placed the ones that fit an existing category into that box, I had 7 'Other's left.

Actually, that's the second best outcome to hope for when allowing for 'Other.' More often, you get vaguely worded responses that could fit in any number of existing categories, if you only had the proper details. For example, in another item on our survey, I have 17 'Others.' I have no idea where to put two-thirds of them because they lack enough detail for me to choose between 2-3 existing options.

Fortunately, those two items are the standouts, and for remaining questions with 'Other' options, only 4-5 people selected them. Even for those few other responses that are gibberish, I'm not losing a lot of cases by calling them unclassifiable. But still, going through other responses is time consuming and requires a lot of judgement calls.

Obviously, what you want to do is minimize the number of 'Other' responses from the beginning. I know this is far easier said than done. But there are some tricks.

Get experts involved in the development of your survey. And I don't just mean experts in survey design (yes, those too, but...); I mean people with expertise in the topic being surveyed. Ask them what terms they use to describe these categories. And ask them what terms their coworkers and subordinates use to describe these categories. Find potential responses that are widely used and as unambiguous as possible. You'll still have a few stragglers who don't know the terms you're using, but you'll hopefully minimize your stragglers.

If possible, pilot test your survey with people who work in the field. And if your survey is very complex, consider doing cognitive interviews (look for a future blog post on that).

Find a balance in the number of options. What this really comes down to is balancing cognitive effort. You want to have enough to cover relevant situations, because that requires less cognitive effort from you when analyzing your data. You just run your descriptives and away we go.

But you also need to minimize response options to a number people can hold in memory at one time. The question above with the 17 other responses was also the question with the most response options. More isn't always better. Sometimes it's worse. I think for this item, we just got greedy about how much information and delineation we wanted in our responses. But if your response options become TL;DR, you'll get people skipping right to 'Other' because that requires less cognitive effort from them.

Balancing cognitive effort won't be 50/50. Someone is always going to pay with more cognitive effort than they'd like to exert and that someone should almost always be you. If you instead make your respondents pay with more effort than they'd like to exert, you end up with junk data or no data (because people stop completing the survey).

And of course, decide whether you care about other at all. If there are only a fixed number of situations and you think you have all of them addressed, you could try just dropping the other category altogether. Know that you'll probably have people skip the question as a result, if their situation isn't addressed. But if you only care to know if X, Y, or Z situations apply to respondents, that might be okay. This comes down to knowing your goal for every question you ask. If you're not really sure what the goal is of a question, maybe you don't need that question at all. As with number of response options, you also want to minimize the number of questions, by dropping any items that aren't essential. It's better to have 100 complete responses on fewer questions than 25 complete and 75 partial responses on more questions.