Monday, April 16, 2018

N is for N (Sample Size) Estimation: Power Analysis in R

N Estimation We're pushing forward in the Blogging A to Z Challenge! Today, I'll talk about conducting power analyses in R with the pwr package. It's amazing the number of studies designed and conducted without the aid of power analysis. Once I learned how to do them in grad school, I couldn't see any other way to design a study - how else do you know how many people you need? Conducting a study is no guarantee you'll find a significant effect, but if there is one to be found, you want to make sure you're maximizing the chance that you'll find it. Otherwise, you might go to a lot of time and effort to do a study and then have nothing to show for it.

To estimate sample, you need a few pieces of information, one of which is the effect size you're hoping to detect. Later this month, I'll show how to conduct a meta-analysis in R, which is really useful to do on a small scale to generate this effect size and help inform a power analysis. I'll talk about that in those future posts, and I'm hoping to write an article for a methods journal going into detail about how to do a mini meta-analysis, using the one I did for my dissertation as a working example.

The other pieces of information you need, besides effect size, are:
  • The statistical analysis you plan to use.
  • Power - 1-Beta; convention is 0.8, or an 80% chance that, if there is a significant difference, you'll detect it, but 0.9 is sometimes used.
  • Significance level - what alpha you'll be using for your significance tests; 0.05 is the convention.
The pwr package allows you conduct power analysis for the following statistical analyses:
  • One-Sample Proportion: comparing a proportion to a population value
  • Two-Sample Proportion: comparing proportions from two samples with either equal or unequal n's
  • One-Sample T-Test: comparing a sample mean to a population mean
  • Paired-Sample T-Test: comparing means at two time points in one sample or from two matched samples
  • Two-Sample T-Test: comparing means from two samples with either equal or unequal n's
  • Chi-Square Test
  • One-Way ANOVA: comparing means from three or more samples
  • General Linear Model Test: including regressions and Factorial ANOVAs
  • Correlation
Each power analysis function requires 3 of the 4 pieces of information: effect size, sample size, significance level, and power. To estimate sample size, leave that piece of information out of the function, or set its value as NULL. But you could also use these functions to conduct a post-hoc power analysis - what was your power given the effect size, significance level, and sample size you used? To do that, you would just leave out power.

First, let's do a very simple power analysis. In last year's Blogging A to Z, I used an ongoing example of a (fictional) study on the impact of caffeine on test performance, and I demonstrated how to conduct a t-test using some data I generated. Descriptive statistics for the two groups showed:

Experimental group: M = 83.2, SD = 6.21
Control group: M = 79.3, SD = 6.40

I can compute an effect size - for two means, the effect size you'd use is Cohen's d.

caffeine<-read.delim("caffeine_study.txt", header=TRUE)
caffeine$group<-factor(caffeine$group, labels=c("Control","Experimental"))
## Warning: package 'effsize' was built under R version 3.4.4
cohen.d(caffeine$score, caffeine$group, pooled=TRUE, conf.level=0.95)
## Cohen's d
## d estimate: 0.62398 (medium)
## 95 percent confidence interval:
##        inf        sup 
## 0.09471117 1.15324889

My effect size is about 0.624. Let's say you want to replicate my study, but rather than just picking a random number of people to have in each group, you want to determine how many you need to detect an effect of that size. For that, we'll use the pwr package.

## Warning: package 'pwr' was built under R version 3.4.4
pwr.t.test(d=0.624, sig.level=0.05, power=0.8,
           type = "two.sample", alternative = "two.sided")
##      Two-sample t test power calculation 
##               n = 41.29776
##               d = 0.624
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## NOTE: n is number in *each* group

To detect an effect of that size, you want about 41 people per group, if you're conducting a two-sided test. If you instead want one-sided, you'd specify whether you expect your comparison group to be "less" or "greater".

pwr.t.test(d=0.624, sig.level=0.05, power=0.8,
           type = "two.sample", alternative = "greater")
##      Two-sample t test power calculation 
##               n = 32.45438
##               d = 0.624
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater
## NOTE: n is number in *each* group

So if you predict ahead of time that the experimental group will have a higher score, and only want to look for a significant difference in that direction, you need about 32 people per group. This tells me that the fake study I designed (and just randomly picked sample sizes for) was slightly underpowered, since I only had 30 people in each group. (Not to mention these are fake data I set up to have a very large effect. Effects as large as d = 0.624 aren't all that common in the wild.)

What if you wanted to do something a bit fancier, like estimate necessary sample size for a range of potential effect sizes? This next example is based on a program I recently wrote for a problem at work. (Values and some details have been changed.) A state board requires school programs to have a pass rate on a key exam of 80%. If a program misses that pass rate one year, they go on probation. If they miss it for 3 consecutive years, they are suspended and can no longer be considered a state board-approved school. The board has called up a statistician (hi) to see if a handful of school programs have a pass rate significantly lower than this 80% cutoff. The problem is, some of these school programs are small, and as a result, their pass rates may vary a lot from year to year, just by luck of the draw.

In addition to sharing a report of the study results, the statistician wants to include a power analysis, to show how many cases are necessary before she can reliably say a pass rate is too low. She wants to show necessary sample for a range of school program pass rates (say between 50% and 79%), and for two levels of power, 0.8 (convention) and 0.9. Here's how we could go about conducting that power analysis.

First, we need to generate our effect sizes. This is a one-sample proportion, where we're comparing a range of school pass rates (50-79%) to a population value (80%). In the pwr package, the effect size for this proportion is called h. You can access it with the pwr function, ES.h.

p1s <- seq(.5,.79,.01)
h <- ES.h(p1 = p1s, p2 = 0.80)
nh <- length(h)

Next, we'll create a list of our two power levels>

p <- c(.8,.9)
np <- length(p)

You'll notice I also requested the length of these lists. That becomes important now when I tell R to combine these two to start generating my target sample sizes, using a for loop.

sizes <- array(numeric(nh*np), dim=c(nh,np))
for (i in 1:np){
  for (j in 1:nh){
    pow_an <- pwr.p.test(n = NULL, h = h[j],
                         sig.level = .05, power = p[i],
                         alternative = "less")
    sizes[j,i] <- ceiling(pow_an$n)

Now I have data in two columns, target sample sizes for each value of h as my rows, with power = 0.8 in one column and power = 0.9 in the other. I'll rename my columns, so I can easily see which is which, and I'll add in the school pass rates I used, so that I can plot my results with that information on the x-axis.

samp <- data.frame(cbind(p1s,sizes))
colnames(samp)<-c("Pass_Rate","Power.8", "Power.9")
##    Pass_Rate         Power.8           Power.9        
##  Min.   :0.5000   Min.   :   15.0   Min.   :   21.00  
##  1st Qu.:0.5725   1st Qu.:   25.5   1st Qu.:   34.75  
##  Median :0.6450   Median :   51.0   Median :   71.00  
##  Mean   :0.6450   Mean   :  554.2   Mean   :  767.70  
##  3rd Qu.:0.7175   3rd Qu.:  167.2   3rd Qu.:  231.00  
##  Max.   :0.7900   Max.   :10075.0   Max.   :13956.00

For power of 0.8, necessary sample size ranges from 15 to over 10,000. For power of 0.9, the range is 21 to almost 14,000. Now I can start plotting my results from my samp data frame. If I only wanted to show results for one power level (0.8):

passrates<-ggplot(samp, aes(x=p1s, y = Power.8)) +
  geom_point() +
passrates + scale_y_continuous(breaks=seq(0,11000,1000)) +
  scale_x_continuous(labels = scales::percent) +
  labs(title = "Comparing Pass Rates Below 80%",
       x = "Pass Rate",
       y = "N Needed to Detect Significant Difference") +

If I wanted to plot both:

bothpowers<-ggplot(samp, aes(p1s)) +
  geom_line(aes(y=Power.8, colour="0.8")) +
  geom_line(aes(y=Power.9, colour="0.9")) +
  scale_y_continuous(breaks=seq(0,14000,1000)) +
  scale_x_continuous(labels = scales::percent) +
  labs(title = "Comparing Pass Rates Below 80%",
       x = "Pass Rate", y = "N Needed to Detect Significant Difference", colour="Power") +

It doesn't appear that bumping up to 0.9 power makes a huge difference in terms of sample size, except for the smallest comparison - 79% vs. 80%. But either way, the recommendation I would make to the state board is that pass rates should be 70% or less before we can say a program is reliably falling below the board's cutoff. This is especially true for smaller programs. Instead, for programs close to the cutoff, some form of remediation may be a better strategy. If they consistently fall below 70%, then drastic action could be taken.

You can learn more about the pwr package in its manual.

No comments:

Post a Comment