## Friday, April 21, 2017

### Bonus Post: Explained Variance and a Power Analysis in Action

In my beta post, I talked about power analysis, and how I've approached it if I don't have previous studies to guide me on what kind of effect I should expect. For instance, I referenced my study on Facebook use and health outcomes among college students. When I conducted the study (Fall 2011), there wasn't as much published research on Facebook effects. Instead, I identified the smallest effect I was interested in seeing - that is, the smallest effect that would be meaningful.

I used an analysis technique called multiple linear regression, which produces an equation to predict a single dependent variable. Multiple refers to the number of predictor variables being used to predict the dependent variable. And linear means that I expected a consistent positive or negative relationship between each predictor and the outcome. You probably remember working with linear equations in math class:

y = ax + b

where y is the variable you're predicting, a is the slope (how much y changes for each 1 unit change in x), and b is the constant (the value of y when x is 0). (You might have instead learned it as y = mx + b, but same thing.) That's what a regression equation looks like. When there's more than one predictor, you add in more "a*x" terms: a1x1, a2x2, etc.

When you conduct a regression, one piece of information you get is R-squared. This month, I've talked about how statistics is about explaining variance. Your goal is to move as much of the variance from the error (unexplained) column into the systematic (explained) column. Since you know what the total variance is (because it's a descriptive statistic - something you can quantify), when you move some of the variance over to the explained column, you can figure out what proportion of the variance is explained. You just divide the amount of variance you could explain by the total variance. R-squared is that proportion - it is the amount of variance in your dependent variable that can be explained by where people were on the predictor variable(s).

By the way, R-squared is based on correlation. For a single predictor variable, R-squared will be the squared correlation between x and y (R = r). For multiple predictor variables, R-squared will be the squared correlation of all the x's with y, after the correlation between/among the x's is removed (the overlap between/among the predictors).

My main predictor variable was how people used Facebook (to fixate on negative events or celebrate positive events - so actually, there were two predictor variables). The outcome was health outcomes. The other predictor variables were control variables - other variables I thought would affect the outcomes beyond Facebook use; this included characteristics like gender, race, ethnicity, and so on.

For my power analysis prior to conducting my Facebook study, I examined how many people I would need to find an R-squared of 0.05 or greater (up to 0.50 - and I knew it was unlikely I'd find an R-squared that high). I also included the following assumptions when I conducted the power analysis: my alpha would be 0.05 (Type I error rate), my power would be at least 0.80 (so beta, Type II error rate, of 0.20 or less), and my control variables would explain about 0.25 of the variance. Using a program called PASS (Power Analysis for the Social Sciences), I was able to generate a table and a graph of target sample sizes for each R-squared from 0.05 to 0.50:

For the smallest expected R-squared (0.05), I would have needed 139 people in my study to have adequate power for an R-squared that small to be significant (unlikely to have occurred by chance). The curve flattens out around 0.25, where having a large R-squared doesn't really change how many people you need.

So based on the power analysis, I knew I needed about 140 people. The survey was quite long, so we expected a lot of people to stop before they were finished; as a result, I adjusted this number up so that even if I had to drop a bunch of data because people didn't finish, I would still have at least 140 usable cases. Surprisingly, this wasn't an issue - we ended up with complete data for 257 participants, 251 of whom were Facebook users.