Sunday, June 25, 2017

Statistics Sunday: Chi-Square - ANOVA for Proportions

Back in May, I blogged about the Analysis of Variance (ANOVA). This test is used when you have 3 or more means and tells you if at least one is significantly different from the expected value, the overall (or grand) mean. But many of the tests I've blogged about so far are only used when your dependent variable is continuous. What if you have an outcome variable that is categorical or ordinal?

For example, your dependent variable might be a two-level outcome - such as pass or fail, survived or didn't survive, Coke or Pepsi. In my research methods course, I would bring in photo copies of old yearbooks, and we would do a smiling study. First, we had to create a good operational definition for smile, to make sure we were coding consistently. After all, people have very different personal definitions and continually disagree on what is and is not a smile:


We would then go through the yearbook pages and code whether a person is smiling. We'd generate hypotheses about whether men or women would be more likely to smile or differences in smiling by grade. But at the end of the coding, we have a bunch of binary data - frequencies and proportions. What statistical test can we use to test our hypotheses?

Remember that when we have continuous outcomes, the mean is our expected value. And when we conduct ANOVA, our expected value is the grand mean. But when we have binary outcomes (or even a multi-level outcome where the mean would be meaningless - pun fully intended), we have to use a different expected value. We use how we would expect our frequencies to fall in the various groups by chance alone - that is, if there is no relationship between the groups and the outcome.

Let's use our smiling study as an example, and we'll test the hypothesis that women are more likely to smile than men. This gives us a simple chi-square between two groups with a binary outcome. The table we get as a result is called a 2 x 2 contingency table. Say that we went through and coded all 1,000 students in a high school - freshmen through seniors - and found that overall, 700 of them were smiling and 300 were not, or as percentages, 70% are smiling and 30% are not. These are our expected values for our groups. If there is no relationship between gender and smiling, we would expect 70% of men and 70% of women to smile.

We compare these expected values to our observed values. This is the part that looks very much like ANOVA; we subtract each expected value from its respective observed value, then square that difference, because we'll have both positive and negative values (our deviations) and we don't want them to cancel out. Each squared deviation is divided by its expected value and the results are added together. This gives you your chi-square. Like ANOVA, it is always positive, and theoretically has no upper limit. The test statistic has an associated p-value, which you would once again compare to alpha to determine if the difference is large enough to conclude that there is a relationship between gender and smiling.

The chi-square test is also sometimes called a test of independence - that's because it is testing whether the group and outcome variables are independent of each other, meaning not related. As I said above, let's say in our study example 70% of people are smiling. Let's also say that we found that 75% of women and 65% of men were smiling. Are those values different enough from 70% to say there is a gender difference? Let's find out, using R! First, I generated data to match the specifications and turned that into a data frame I can analyze (the very first command suppresses scientific notation, so p-values are easier to read; this is one of the first codes I include with any R script I write):
options(scipen=999999)
women<-sample(0:1, 500, replace=T,prob=c(0.25,0.75))
men<-sample(0:1, 500, replace=T, prob=c(0.45,0.65))

female<-rep("Female",each=500)
male<-rep("Male",each=500)

smile<-c(women,men)
gender<-c(female,male)

smile_data<-data.frame(gender=gender, smile=smile)
Now, we'll create a table of our results - this is often called a cross-table (crosstabs or xtabs for short); the variable listed first will be displayed in rows and the variable listed second in columns:
mytable<-xtabs(~gender+smile, data=smile_data)
mytable
##         smile
## gender     0   1
##   Female 117 383
##   Male   224 276
Finally, we'll run a chi-square, which is really easy to do with the R stats base package. We just request a summary of the table object we created:
summary(mytable)
## Call: xtabs(formula = ~gender + smile, data = smile_data)
## Number of cases in table: 1000 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 31.444, df = 1, p-value = 0.00000002053
As you can see, the p-value is very small, much smaller than 0.05. So we would conclude from these data that there is a gender difference: women are more likely to smile in yearbook photos than men.

That's all for now! In a future post, I plan to explain the concept of degrees of freedom - as you've seen, this is relevant in the different statistical tests we've covered thus far. And if there are any other statistics topics you'd like me to cover, let me know in the comments below!

*Edit: There was an error in my code, where I accidentally switched the 0 and 1 coding. This has been fixed - apologies!

2 comments:

  1. thank you for the valuable information giving on data science it is very helpful.
    Data Science Training in Hyderabad

    ReplyDelete
  2. your article on data science is very interesting thank you so much.
    Data Science Training in Hyderabad

    ReplyDelete