Sunday, January 7, 2018

Statistics Sunday: On Birthdays, and One-Tailed and Two-Tailed Tests

It's my birthday today! I spent last night celebrating with friends. Today I'll be relaxing, reading, and writing, plus dinner at one of my favorite restaurants. Obviously, birthdays have some important implications:


For today's Statistics Sunday post, I wanted to revisit a topic I touched on briefly in my post on null and alternative hypotheses, using our silly hypothesis above as an example. We could update this hypothesis to be something about having more birthday parties, but obviously, there would still be a third variable issue here.

Whenever you set out to conduct statistical analyses, you have hypotheses you want to test. While I would argue that you should focus entirely on your research hypotheses, rather than actively writing out null and alternative hypotheses, it is a good thing to remember what it is you're testing and how you will know you have the expected effect. And that's one way your statistical hypotheses (null and alternative) can help guide you.

Those statistical hypotheses, among other things, allow you to derive your critical value - the cut-off between significant and non-significant. This is based on your selected alpha. When you're testing for a difference, one that could go in either direction, you divide your alpha into 2. In this case, your cut-off in your normal distribution would be in both tails.

But if I wanted to test the hypothesis above, that people who have had more birthdays live longer, it doesn't make sense to test for a difference. It makes sense to test in a certain direction: people who have had more birthday parties live longer. In this case, you can maximize the chance that you'll find a significant effect by putting all of your alpha in one tail.

If you did a really simple study, where you recruited people who had never had a birthday party and people who celebrated their birthdays, then tracked how long they lived, you would set your cut-off value so that all of your alpha is in the upper tail of your distribution - and you would only reject if your birthday partiers were significantly older at their time of death than your birthday non-partiers. Your cut-off value will be lower as a result, making it easier to reject the null hypothesis, but you also only get to reject the null hypothesis if the difference is in the expected direction (that is, the partiers were significantly older rather than significantly younger).

This concept is taught in statistics courses, and students are often tested on whether they understand one-tailed versus two-tailed tests. The thing is, I rarely see this done in journal articles. Not every statistical analysis allows for one- versus two-tailed tests, and it wouldn't make sense to do this for more complex analyses. But it's surprising that when people are doing simple t-tests, which I do see in journal articles, there doesn't seem to be any consideration of directionality of hypotheses.

What do you think? Have you encountered one- and two-tailed tests outside of statistics courses?

Friday, January 5, 2018

In Its Prime

What do you get when you take 2 to the power of 77,232,917 and subtract one?

The largest prime number ever discovered.
The number belongs to a rare group of so-called Mersenne prime numbers, named after the 17th century French monk Marin Mersenne. Like any prime number, a Mersenne prime is divisible only by itself and one, but is derived by multiplying twos together over and over before taking away one. The previous record-holding number was the 49th Mersenne prime ever found, making the new one the 50th.
And just like the Psychological Science Accelerator I blogged about earlier today, it's the product of teamwork:
The new prime number was originally found on Boxing Day by the Great Internet Mersenne Prime Search (Gimps) collaboration which harnesses the number-crunching power of volunteers’ computers all over the world. In the days after, four more computers sporting different hardware and software were set the task of verifying the discovery. Those computers confirmed the result, taking between 34 and 82 hours each.

To find M77232917 in the first place took six full days of nonstop computing on a PC owned by Jonathan Pace, a 51-year old electrical engineer from Germantown, Tennessee. It is the first prime that Pace’s computer has churned out in 14 years on the Gimps project. He is now eligible for a $3,000 award. 

Teamwork and the Reproducibility Problem

It has been known for some time that psychology has a reproducibility problem, though we may not always agree on how to handle or discuss these issues. I remember chatting with another researcher at a conference shortly after I finished my masters thesis on stereotype threat and its impact on math performance in women. I had failed to replicate stereotype threat effects in my study. She, on the other hand, said her effects were incredibly strong; she described a participant experiencing a panic attack when she was told she had to do math problems, and had even noticed her female participants' math performance was negatively affected when her research assistant had been knitting during the session. (I also remember a reviewer telling me I must have performed the study poorly, not because the reviewer found any flaws in my methods, but because I had failed to reproduce the stereotype threat effects in my research.)

Efforts to handle this crisis thus far have included making psychological research more transparent and large-scale meta-analyses. And a new effort is already underway to harness the power of multiple research labs across the world: the Psychological Science Accelerator. Christie Aschwanden of FiveThirtyEight has more:
[Psychologist Christopher] Chartier, a researcher at Ashland University, doesn’t think massively scaled group projects should only be the domain of physicists. So he’s starting the “Psychological Science Accelerator,” which has a simple idea behind it: Psychological studies will take place simultaneously at multiple labs around the globe. Through these collaborations, the research will produce much bigger data sets with a far more diverse pool of study subjects than if it were done in just one place.

The accelerator approach eliminates two problems that can contribute to psychology’s much-discussed reproducibility problem, the finding that some studies aren’t replicated in subsequent studies. It removes both small sample sizes and the so-called weird samples problem, which is what happens when studies rely on a very particular population — like relatively wealthy college students from Western countries — that may not represent the world at large.

So far, the project has enlisted 183 labs on six continents. The idea is to create a standing network of researchers who are available to consider and potentially take part in study proposals, Chartier said. Not every lab has to participate in any given study, but having so many teams in the network ensures that approved studies will have multiple labs conducting their research.
According to the blog, the Psychological Science Accelerator is taking on its second study, this one on gendered social category representation. And if you're attending the Association for Psychological Science meeting in May, you can check out a symposium on "Large Scale Research Collaborations: Applications in Crowd-Sourcing and Undergraduate Research Experience, Replications, and Cross-Cultural Research." (Day and time TBD - APS is still finalizing the program, and is still accepting poster submissions through the end of this month.)

Thursday, January 4, 2018

David Bowie's Top 100 Books

I just learned today that David Bowie's son, Duncan Jones is starting a book club so people can read through David Bowie's Top 100 books. Nerdist has more:
Now we’ve learned, via Consequence of Sound, that in admiration for the rock icon’s favorite books, his son, director Duncan Jones, has launched the “David Bowie Book Club.”
The book club will stick to Bowie’s 100 favorite books from the previously mentioned list, and the first book to be discussed will be the 1985 novel Hawksmoor by Peter Aykroyd, an author who Jones says was one of his father’s “true loves.” If you want to follow along with the book club, the deadline to finish the 288 page novel is February 1, so you better get to your local library or hit up your local bookseller or Amazon soon.
In the meantime, I took a quick inventory of all the books I own that have languished on my to-read pile(s) and also just received a literal box o' books from my parents for my birthday (in addition to all the books I received for Christmas). I'm probably good for a while. But I may try to get to Hawksmoor before February 1. Thanks to my no-book-buying resolution, I'll have to check out my local library.

Wednesday, January 3, 2018

Statistical Sins: Junk Science

This isn't exactly a statistical sin, but it's probably one of the worst sins against science - buying into garbage that, even worse than being of no help, might actually kill you. It's a sign of how comfortable many in our society have become, being free from worry about life-threatening illnesses, that they begin to wonder if the things that are keeping us alive and healthy are of any use at all.

We've seen this happening for a while with vaccinations. And now, it's happening with water:
In San Francisco, "unfiltered, untreated, unsterilized spring water" from a company called Live Water is selling for up to $61 for a 2.5-gallon jug — and it's flying off the shelves, The New York Times reported.

Startups dedicated to untreated water are also gaining steam. Zero Mass Water, which allows people to collect water from the atmosphere near their homes, has already raised $24 million in venture capital, the report says.

However, food-safety experts say there is no evidence that untreated water is better for you. In fact, they say that drinking untreated water could be dangerous.

"Almost everything conceivable that can make you sick can be found in water," one such expert, Bill Marler, told Business Insider. That includes bacteria that can cause diseases or infections such as cholera, E. coli, hepatitis A, and giardia.
In a world where 884 million people do not have access to clean water, rich people in California (and elsewhere) are paying hundreds of dollars for water that could make them sick or even kill them. Perhaps the most telling quote from the article is this one, from Bill Marler:
"You can't stop consenting adults from being stupid," Marler said. "But we should at least try."
In fact, there are a variety of explanations for why people might buy into such junk science. Not only the comfort of never having to worry about a cholera epidemic or seeing firsthand the complications of polio, but also the use of vague euphemisms, like "raw water." The price tag might also be an indicator of quality. I remember hearing a story (possibly apocryphal) about Häagen-Dazs, that it was originally less expensive with a more generic name. But when they changed the name to Häagen-Dazs and increased the price, it started flying off the shelves. Obviously, getting a celebrity or someone of influence on board can also help something take off.

Still, it's fascinating to me how some of this junk science proliferates. The pattern of diffusion for innovations is well-known, and while we know that not every innovation will take off (like the Dvorak keyboard), innovations are by their nature things that make our lives better/easier, and the ones that take off are likely the ones with the best marketing. But junk science is absolutely not an innovation, and in some cases, they make our lives worse or harder. How, then, do we explain some of the nonsense that continues to influence people's behavior? What sorts of outcomes does it take before people see the error of their ways?

Tuesday, January 2, 2018

Data Analysis of My 2017 Reading

This year, I'd like to do more data analysis of things that interest me. While some of it might be publishable in actual journals, a lot of it will be fluffy stuff best suited for my blog. Today, some of that fluffy analysis using the books I read in 2017.

Thanks to Goodreads, I have a lot of data available on my reading habits - when I started and finished each book, how I rated each book (and how that compares to the average Goodreads rating), and what genres I tend to gravitate toward. I pulled my Goodreads data into an Excel spreadsheet, which included the following variables:
  • Book title, string
  • Author, string
  • Goodreads Rating (Average of every rating for this book), numeric from 1 to 5
  • My Rating, integer from 1 to 5
  • Start Date
  • Finish Date
I then calculated/created the following variables
  • A rounded version of the Goodreads rating, since I can only rate 1 to 5
  • A leniency indictor, my rating minus the rounded rating
  • Number of pages (pulled from Amazon)
  • Read time, the number of days it took me to finish
  • Title length, the number of characters in the title
  • Wait, the number of days between finishing one book and starting another; NA if I was reading 2 books at once and was still actively reading another book when I finished one
  • An indicator of author gender
  • A fiction indicator
  • And indicators for the following genres
    • Biography & Memoir
    • Children's Fiction
    • Comedy
    • Economics
    • Experimental Fiction
    • Fantasy
    • History
    • Horror
    • Literary Fiction
    • Math & Statistics
    • Mystery
    • Programming
    • Science
    • Science Fiction
    • Writing
    • YA Fiction
First, I did some simple counts of genres using a Pivot Table and created a bar chart:


As the chart shows, my favorite genre appears to be fantasy (I read 19 books in this genre) followed by math and statistics (11 books). I was surprised at the number of young adult fiction books, because I only picked up one book from that section in the bookstore; the rest I found in general fiction or on a fiction display table. I only found out later (i.e., when I was pulling together this dataset) that they were considered YA fiction. I should note that the counts in the graph above add up to more than 53, because books could belong to more than 1 genre. For instance, last year I read Storm Front, book 1 of the Dresden Files series by Jim Butcher, which qualified as both fantasy (the main character is a wizard) and mystery (he's also a private investigator who solves weird cases often involving the supernatural). On the nonfiction front, The River of Doubt: Theodore Roosevelt's Darkest Journey by Candice Millard qualified as both biography & memoir and history.

I also learned that I read more books by men than women; nearly 70% of the books I read were by men. I was more evenly split on fiction versus nonfiction; 56.6% were fiction. Additionally, my average rating was 4.09, and my average leniency (i.e., my rating was higher than the rounded off Goodreads rating) was only 0.09.

All interesting stuff, but let's do some higher level analysis. For that, I brought in a much better tool than Excel: R. I saved my Excel file into a CSV file and pulled it into R. 

First, let's see what kinds of variables predict the rating I give a book. I only have 53 observations, so I need to be choosy about the variables I include in a linear model. It looks like I'm pretty much in line with the average Goodreads rating, so I'll leave that variable out; it doesn't tell me much. 

Here are some variables I think would impact my rating: number of pages, reading time, author gender, and whether the book was fiction. I'd like to include genre, but I don't want to include all 16; I'll include the top 3 genres, which would be fantasy, math & statistics, and YA fiction. But before I get started with my linear model, I want to examine correlations between some of these variables. For instance, number of pages and reading time are likely correlated with each other. I want to make sure they're not so highly correlated that one could be a proxy for another. 

cor(books$Pages,books$Read_Time)
0.4024597

The correlation between these two variables is high, but I could probably get away with including both in the model, since there's only about 16% shared variance between the two. However, I should note one issue with the reading time variable. Two books were read during November, when I was spending most of my time writing, so these two books have very high reading time values: 17 days (started in October and finished in November) and 21 days (started in November and finished in December). Also, I have one other book that took 16 days to finish, but this was a course textbook, and I read the book following the course schedule. The longest read time not impacted by course schedule or NaNoWriMo is 15 days. So I could try running a correlation between these two variables without those three books with weird values. I can do this pretty easily with a filter.

library(dplyr)
filtered<-filter(books, Read_Time<16)

This gives me a separate dataset that drops those 3 values. If I run my correlation on this dataset, I get an even lower correlation, 0.31. I expected this correlation to go up, not down, without these 3 cases. So in either case, it seems like including both number of pages and reading time in the same linear model is not going to cause any issues.

Just for fun, I also produced histograms for those two variables. Number of pages was approximately normal, but reading time was positively skewed. (I include code for one of the histograms below, to show how it's done; as you can see, the code is pretty easy.)

hist(books$Pages, main="Book Length", xlab="Number of Pages")



I also looked at relationships between two of my genres, fantasy and YA fiction, since I remembered that many books fell in both. Turns out, all of the YA fiction books were also fantasy.

table(books$Fantasy,books$YA_Fic)

     0  1
  0 34  0
  1 13  6

I could still include both, and see what additional impact YA fiction as a genre has on ratings. But I could also see some strange results. Let's run a linear model, using My Rating as the y-variable, and number of pages, reading time, author gender, and indicators for fiction and my top 3 genres (fantasy, math & statistics, YA fiction) as predictors. (And then I'll run it once more without YA fiction.)

myrating<-lm(My_Rating ~ Pages + Read_Time + Gender + Fiction + Fantasy + Math_Stats + YA_Fic, data=books)
summary(myrating)

And here's a screenshot of my results:



So number of pages was significant, even controlling for reading time; on average, longer books got higher ratings from me. Gender of the author had no impact on ratings. While fiction books didn't necessarily get higher ratings than non-fiction books, I tended to give fantasy books higher ratings, and young adult fiction lower ratings. When I ran the model again without YA fiction, the fantasy effect disappeared. It appears I judge YA fantasy more harshly than other fantasy. Perhaps this year, I'll try reading some YA fiction that isn't fantasy and see what happens.

Some other fun factoids from this dataset: the book with the longest title,The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy by Sharon Bertsch McGrayne (161 characters) was read right before the book with the shortest title, Theatre by David Mamet (7 characters). I read 6 books by Neil Gaiman (and 1 cowritten by him), the most of any author.

What do you think? Any other relationships I should explore?

Monday, January 1, 2018

2018 Goals

Happy New Year! I had a great New Year's Eve, attending a party with friends, and have been taking it easy today. But I've been thinking about my goals for the year, and wanted to share for accountability purposes.

First off, I want to add one quick update to my 2017 Year in Review - I finished book 53 before heading out to the party last night, making my page count 17194. I'm sure I had many years as a kid where I read more than that, but this is my highest count since I started tracking.

And now for my goals!
  1. Read at least 48 books - this is the same goal I set last year. I commute by train now, and I always have a book with me in case of unexpected downtime, so this seems to be an easy goal for me.
  2. Relatedly, I have a huge stack of to-read books, so I'm making a resolution that I can't buy any books this year. Instead, I need to read all of the books on my to-read shelves (yes, there are multiple). I am allowed to borrow books, from the library or friends, and I can receive books as gifts, but no purchases. I have a feeling this one is going to be very hard.
  3. Write at least 12 short stories - Ray Bradbury recommends writing one per week, but with a full-time job, multiple hobbies, and a social life, that's going to be difficult. I think one per month is a good goal, and I can always exceed it if I'm feeling particularly inspired.
  4. Make sure I always post my weekly statistics posts: Statistics Sunday and Statistical Sins. For that reason, I'm not going to make the goal of writing 1 post per day. I'm happy with 3-4 posts per week, and once again, I can exceed that if I'm feeling particularly inspired.
  5. Build up more of an online presence for Deeply Trivial, including Twitter and Facebook. I just need to finish something up first - stay tuned!
  6. Visit a new state - I've visited 35 of them, so I want to bring that total to 36 by the end of 2018! Most of the states I have left are on the East Coast or Northern Midwest, so I have some easy ones I can hit on a road trip. But who knows? Maybe I'll spoil myself with a trip to Hawaii this year instead.
  7. I always make a goal to eat healthier and get in better shape, but I plan on really putting some effort into it this year. (My weight has been creeping up and I'm not happy about it.) I already go to a dance class once a week, so I think I can add a goal of getting a workout in 1-2 more times per week.
  8. Finish my book! I've been working on the book I wrote for 2016 NaNoWriMo - I still have one subplot to wrap up, and a few more scenes to write.
You'll notice I'm calling them goals rather than resolutions (well, except #2). I've blogged previously about the problem with resolutions. I called them resolutions last year, but after putting more thought into the idea recently, I think goals is more true to my approach to making New Year's resolutions. But if you insist...