Thursday, April 19, 2018

Q is for qplot

Q is for qplot You may have noticed that I frequently use the ggplot2 package and the ggplot function to produce graphics for my posts. ggplot2, which is part of the so-called tidyverse, is called gg to refer to the "grammar of graphics". That is, it uses standard functions and arguments to produce any number of graphics. You can change the appearance of these graphics by applying different settings. The nice thing about this type of syntax is that once you learn it for one type of graphic - say a histogram - it's very easy to expand out to other types of graphics - like scatterplots - without having to learn brand new functions. ggplot is a great way to create high-quality, publication-ready graphics.

But sometimes you don't need high-quality, publication-ready. Sometimes you just need a quick look at the data and you don't care if you have axis labels or centered titles. You just need to make certain there isn't anything wonky about your data as you clean and/or analyze. Fortunately, ggplot2 has a great function for that - qplot (or quick plot).

As with ggplot, qplot has a standard function and set of arguments, so once you learn to do it for one type of graphic, you can easily expand to others. And qplot has some smart rules built in to default to two of the most frequently used charts (particularly for quick looks at the data): histograms and scatterplots. Why are these most frequently used, especially in cleaning and early stages of analysis? A histogram lets you see if your variable is approximately normal; this is important because many statistical tests (and most of them you would have learned in an Introductory Statistics course) are built on the assumption that data are normally distributed. A scatterplot lets you see if your variables are related to each other, and whether that relationship is linear or not; once again, many statistical tests are built on assumptions about linear relationships between variables. So it makes sense that, if you're taking a quick look, you'll probably be using one of these two graphics.

The default graphics are very easy to produce: if you give only an x variable, you'll get a histogram, and if you give both x and y, you'll get a scatterplot. I'll use the Facebook data once again to demonstrate. I also went ahead and scored the RRS and SBI (described below) here - you can find code for scoring all measures here.

Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)
Facebook$RRS<-rowSums(Facebook[,3:24])
reverse<-function(max,min,x) {
  y<-(max+min)-x
  return(y)
}
Facebook$Sav2R<-reverse(7,1,Facebook$Sav2)
Facebook$Sav4R<-reverse(7,1,Facebook$Sav4)
Facebook$Sav6R<-reverse(7,1,Facebook$Sav6)
Facebook$Sav8R<-reverse(7,1,Facebook$Sav8)
Facebook$Sav10R<-reverse(7,1,Facebook$Sav10)
Facebook$Sav12R<-reverse(7,1,Facebook$Sav12)
Facebook$Sav14R<-reverse(7,1,Facebook$Sav14)
Facebook$Sav16R<-reverse(7,1,Facebook$Sav16)
Facebook$Sav18R<-reverse(7,1,Facebook$Sav18)
Facebook$Sav20R<-reverse(7,1,Facebook$Sav20)
Facebook$Sav22R<-reverse(7,1,Facebook$Sav22)
Facebook$Sav24R<-reverse(7,1,Facebook$Sav24)
Facebook$SBI<-Facebook$Sav2R+Facebook$Sav4R+Facebook$Sav6R+
  Facebook$Sav8R+Facebook$Sav10R+Facebook$Sav12R+Facebook$Sav14R+
  Facebook$Sav16R+Facebook$Sav18R+Facebook$Sav20R+Facebook$Sav22R+
  Facebook$Sav24R+Facebook$Sav1+Facebook$Sav3+Facebook$Sav5+
  Facebook$Sav7+Facebook$Sav9+Facebook$Sav11+Facebook$Sav13+Facebook$Sav15+
  Facebook$Sav17+Facebook$Sav19+Facebook$Sav21+Facebook$Sav23
library(ggplot2)

I'll use a scale I haven't really used in this series - the Savoring Beliefs Inventory. This measure was created by Fred Bryant, who was my faculty sponsor for this research (since I was still a grad student at the time). Fred also taught me structural equation modeling. The measure assesses a concept Fred calls savoring - fixating on positive events and feelings to retain those feelings of joy and pleasure. I selected this measure to include because, as I mentioned to Fred, I felt savoring was the opposite of rumination. (While he thought I'd made a good point, he told me he thought of savoring as the opposite of coping, which makes sense.)

Using the qplot function, we can quickly generate a histogram with total SBI score.

qplot(SBI, data=Facebook)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This variable shows a negative skew: there is a long tail (fewer cases than we'd expect if this followed the normal distribution) at the low end, the highest part of the distribution is to the right of center, and there is much less of a tail at the high end (more cases than we'd expect if this followed the normal distribution). We're also getting a message about bins. Right now, the histogram is slicing up the values between the minimum and the maximum into 30 bars. We can reduce this number to smooth out the distribution.

qplot(SBI, data=Facebook, bins=15)

This helps make the shape of the distribution more clear - we have a definite negative skew.

Now, regardless of whether the psychological opposite of savoring is coping or rumination, Fred and I both agreed that savoring and rumination would be negatively correlated. We can quickly demonstrate this, thanks to qplot with two variables.

qplot(SBI, RRS, data=Facebook)
cor(Facebook$SBI, Facebook$RRS)
## [1] -0.3510101

I also requested the correlation coefficient for these two variables, which is -0.35. This is a moderate negative correlation.

There are many other things you can do with qplot. First, you could generate separate graphics for groups, using facets.

Facebook$gender<-factor(Facebook$gender, labels=c("Male","Female"))
qplot(SBI, RRS, data=Facebook, facets=~gender)

Alternatively, we could have men and women points displayed on the chart in different colors.

qplot(SBI, RRS, data=Facebook, colour=gender)

You can also change the type of graphic with geom.

qplot(gender, data=Facebook, geom="bar")

There are other things you can do, such as manually set the limits for the x- or y-axes (with xlim=c(min,max) or ylim=c(min,max)), or log transform one or both variables. To demonstrate the latter, I'll bring back in my power analysis dataset - projected sample sizes for proportion comparisons with power of 0.8 or 0.9. I could have log-transformed the sample size variable.

library(pwr)
## Warning: package 'pwr' was built under R version 3.4.4
p1s <- seq(.5,.79,.01)
h <- ES.h(p1 = p1s, p2 = 0.80)
nh <- length(h)
p <- c(.8,.9)
np <- length(p)
sizes <- array(numeric(nh*np), dim=c(nh,np))
for (i in 1:np){
   for (j in 1:nh){
       pow_an <- pwr.p.test(n = NULL, h = h[j],
                            sig.level = .05, power = p[i],
                            alternative = "less")
       sizes[j,i] <- ceiling(pow_an$n)
   }
}
samp <- data.frame(cbind(p1s,sizes))
colnames(samp)<-c("Pass_Rate","Power.8", "Power.9")
qplot(Pass_Rate, Power.9, data=samp, log="y", ylab="Log-Transformed Sample Size")

As you can see, I can add custom labels - by default, it displays variable names, which is fine for a quick look. But I wanted to specifically call-out that y was log-transformed so my little brain didn't get confused if I had to walk away and come back.

Tomorrow will be a code-free post giving a short history of R. And back to coding posts Saturday!

Wednesday, April 18, 2018

P is for Principal Components Analysis (PCA)

P is for Principal Components Analysis This month, I've talked about some approaches for testing the underlying latent variables (or factors) in a set of measurement data. Confirmatory factor analysis is one method of examining latent variables when you know ahead of time which observed variables are associated with which factor. But it isn't the only way to do this. Sometimes you need to test for the presence of latent variables. Exploratory factor analysis is one way, but if I'm doing exploratory latent variable analysis, I tend to prefer using an approach called principal components analysis instead.

Both principal components analysis and exploratory factor analysis have the goal of reducing the amount of data you have to work with, by looking for how individual observed variables hang together. While exploratory factor analysis has the goal of explaining why observed variables relate to each other (because of the influence of the latent variable on the observed variables), principal components analysis has the goal of simply describing the observed variables that relate to each other and appear to measure the same thing (together they result in a score on the latent variable). So a factor analysis model would be drawn with arrows going from the factor to the observed variables, while a principal components analysis would be drawn with arrows going from the observed variables to the component.

More mathematically speaking, principal components analysis explains variance while factor analysis explains covariance (between items). In Winsteps, the Rasch program I use most frequently, the default dimension reduction technique is principal components analysis (also called PCA), and its goal is mainly to ensure that you aren't violating a key assumption of Rasch: that each item measures only one underlying latent construct. We refer to this as the assumption of unidimensionality. If the PCA shows the data violate this assumption, the PCA results can be used to divide items up into subscales, and then perform Rasch analysis on each subscale separately.

PCA output includes many important components, which I'll go through once we get the analysis. But two I want to highlight first are the eigenvalues and factor loadings. Eigenvalues are decompositions of variance, that help to uncover exactly how many factors are present in the data. You'll receive multiple eigenvalues in your analysis, which are contrasts: splitting up the items into the different components based on patterns of correlation. If the first eigenvalue is high and the rest are small (less than 1 or 2, depending on who you ask), that means there is only one component. You want the number of high eigenvalues to be less than the number of items in your measure. If the number of high eigenvalues is equal to the number of items, that means these items don't hang together and aren't measuring the same thing. Basically, this information tells you how many subscales appear to be in your data.

Next is factor loadings, which range from 0 to 1 (and can be positive or negative) and tell you how strongly an item relates to a specific factor or component. Loadings greater than 0.3 or 0.4 (again, based on who you ask) mean an item should be associated with that factor. For most measures, you only want an item to have a high loading on one factor. So you want each item to have a high loading (closer to 1 is better) on 1 factor and loadings close to 0 on the rest of the factors.

You can easily conduct PCA using the psych package. I was originally going to write today's post on the psych package, but it's huge and there's too much to cover. (Yes, I did just include "huge" and "package" in the same sentence on a stats blog.) There's a great website, maintained by the package developer, William Revelle, where you can learn about the many capabilities offered by psych. I frequently use the psych package to run quick descriptive statistics on a dataset (with the describe function). And I also used psych to demonstrate Cronbach's alpha way back at the beginning of this month (which feels like it was years ago instead of just a couple weeks). psych is a great general purpose package for psychological researchers with a heavy focus on psychometrics, so I'd definitely add it to your R packages if you haven't already done so.

So for the sake of simplicity and brevity, I'll focus today on the principal function in the psych package. To demonstrate, I'll use the Facebook dataset, since it contains multiple scales I can use with the principal function. (This isn't how you'd use PCA in practice - since these are developed scales with a known factor structure, confirmatory factor analysis is probably the best approach. But let's pretend for now that these are simply items created toward new scales and that we're looking for ways they might hang together.)

We'll start by installing (if necessary)/loading the psych package, as well as reading in our Facebook dataset.

install.packages("psych")
## Installing package into '/R/win-library/3.4'
## (as 'lib' is unspecified)
library(psych)
## Warning: package 'psych' was built under R version 3.4.4
Facebook<-read.delim("small_facebook_set.txt", header=TRUE)

Now let's dig into the principal function. You can quickly view what arguments a function allows by requesting args:

args(principal)
## function (r, nfactors = 1, residuals = FALSE, rotate = "varimax", 
##     n.obs = NA, covar = FALSE, scores = TRUE, missing = FALSE, 
##     impute = "median", oblique.scores = TRUE, method = "regression", 
##     ...) 
## NULL

r refers to the object you want to analyze - in our case, it will be a data frame, but you could also use a correlation matrix, as long as you also specify n.obs. You can request that the program extract a certain number of factors by setting nfactors. Residuals simply asks if you want the program to report residuals in your output.

The next argument, rotate, is a bit more complicated. There are multiple rotation methods available: "none", "varimax", "quartimax", "promax", "oblimin", "simplimax", and "cluster". But you're probably asking - what exactly is "rotation"? Rotation adds another step into the analysis to help understand and clarify the pattern of loadings. Basically, you can make one of two assumptions about the factors/components in your analysis: they are correlated (oblique) or they are uncorrelated (orthogonal). There are different rotation methods to use depending on which of these assumptions you make. You can learn more about rotation here. If you assume your factors are orthogonal, most would recommend using varimax, which is the default rotation method in the principal function. If you assume your factors are correlated, promax is often recommended. Some discourage rotation completely, which is why none is an option. We'll try all three for demonstration purposes.

Setting scores to TRUE requests the program to generate scale/subscale scores on the factor(s) identified. And missing plus impute is used when you have missing values and want the program to score results; it tells the program to impute missing values, with either "median" or "mean", when generating scores. method also relates to the component scores; with "regression", the default, scale/subscale scores are generated with a linear equation. Finally, olique.scores refers to the matrix (structural versus pattern) used in conducting the PCA; this is a more advanced concept that we won't get into now. You can leave this argument out at the moment. But maybe I should dig into this topic more in a future post or two.

We're now ready to try running a PCA with some of the Facebook data. Let's turn once again to our Ruminative Response Scale data. Remember that this measure can generate a total score, as well as scores on 3 subscales: Depression-Related Rumination, Brooding, and Reflecting. To make it easy to access just the RRS items, I'm going to create another data frame that copies over responses on these items. Then I'll run a PCA, using the default rotation method, on this data frame.

RRS<-Facebook[,3:24]
RRSpca<-principal(RRS)

Now I have an object called RRSpca that holds the results of this analysis. First, let's take a look at our eigenvalues:

RRSpca$values
##  [1] 8.8723501 1.6920860 1.1413928 1.1209452 1.0072739 0.8236229 0.7535618
##  [8] 0.6838558 0.6530697 0.6427205 0.5703730 0.5616363 0.4937120 0.4860600
## [15] 0.4218528 0.3830052 0.3388684 0.3185341 0.2863681 0.2675727 0.2468134
## [22] 0.2343256

Most of the variance is accounted for by the first contrast, which contains all of the items on a single component. This supports reporting a total score for the RRS. It doesn't support the presence of the 3 subscales, though. Let's take a look at the rest of our results:

RRSpca
## Principal Components Analysis
## Call: principal(r = RRS)
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PC1   h2   u2 com
## Rum1  0.62 0.38 0.62   1
## Rum2  0.53 0.28 0.72   1
## Rum3  0.50 0.25 0.75   1
## Rum4  0.57 0.32 0.68   1
## Rum5  0.57 0.32 0.68   1
## Rum6  0.64 0.41 0.59   1
## Rum7  0.64 0.41 0.59   1
## Rum8  0.64 0.41 0.59   1
## Rum9  0.62 0.38 0.62   1
## Rum10 0.66 0.44 0.56   1
## Rum11 0.61 0.37 0.63   1
## Rum12 0.35 0.12 0.88   1
## Rum13 0.55 0.30 0.70   1
## Rum14 0.70 0.49 0.51   1
## Rum15 0.69 0.47 0.53   1
## Rum16 0.76 0.58 0.42   1
## Rum17 0.74 0.54 0.46   1
## Rum18 0.71 0.50 0.50   1
## Rum19 0.71 0.50 0.50   1
## Rum20 0.75 0.56 0.44   1
## Rum21 0.57 0.33 0.67   1
## Rum22 0.71 0.51 0.49   1
## 
##                 PC1
## SS loadings    8.87
## Proportion Var 0.40
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 component is sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.08 
##  with the empirical chi square  725.56  with prob <  3.2e-58 
## 
## Fit based upon off diagonal values = 0.96

All of the factor loadings, shown in the column PC1, are greater than 0.3. The lowest loading is 0.35 for item 12. I tend to use a cut-off of 0.4, so I might raise an eyebrow at this item, and if I were developing a scale, I might consider dropping this item. The column h2 is a measure of communalities, and u2 is uniqueness; you'll also notice they add up to 1. Communalities refers to shared variance with the other items, while uniqueness is variance not explained by the other items, but that could be explained by the latent variable as well as measurement error. The last column is complexity, which we'll ignore for now.

Near the bottom of the output, you'll see SS (sum of squares) loadings, which is equal to your first eigenvalue, as well as proportion var - this tells us that 40% of the variance in responses can be explained by the first latent variable. Finally, you'll see some familiar fit indices - RMSEA, chi-square, and fit based upon off diagnoal values (a goodness of fit statistic, with values closer to 1 indicating better fit).

Overall, these values support moderate to good fit. Since there's only 1 factor/component, rotation method doesn't really matter; remember, rotation refers to the correlations between factors or components, and we only have one. But let's try running these same data again, using the varimax rotation, and forcing the program to extract 3 components.

RRSpca_3<-principal(RRS, nfactors = 3)
RRSpca_3$values
##  [1] 8.8723501 1.6920860 1.1413928 1.1209452 1.0072739 0.8236229 0.7535618
##  [8] 0.6838558 0.6530697 0.6427205 0.5703730 0.5616363 0.4937120 0.4860600
## [15] 0.4218528 0.3830052 0.3388684 0.3185341 0.2863681 0.2675727 0.2468134
## [22] 0.2343256
RRSpca_3
## Principal Components Analysis
## Call: principal(r = RRS, nfactors = 3)
## Standardized loadings (pattern matrix) based upon correlation matrix
##         RC1  RC3   RC2   h2   u2 com
## Rum1   0.59 0.19  0.24 0.45 0.55 1.5
## Rum2   0.05 0.65  0.23 0.48 0.52 1.3
## Rum3   0.43 0.43 -0.09 0.38 0.62 2.1
## Rum4   0.24 0.73 -0.04 0.58 0.42 1.2
## Rum5   0.60 0.21  0.10 0.42 0.58 1.3
## Rum6   0.37 0.63  0.06 0.54 0.46 1.6
## Rum7   0.41 0.17  0.57 0.53 0.47 2.0
## Rum8   0.32 0.44  0.36 0.42 0.58 2.8
## Rum9   0.19 0.69  0.18 0.54 0.46 1.3
## Rum10  0.24 0.54  0.38 0.50 0.50 2.2
## Rum11  0.23 0.18  0.74 0.63 0.37 1.3
## Rum12 -0.02 0.04  0.72 0.52 0.48 1.0
## Rum13  0.58 0.17  0.14 0.39 0.61 1.3
## Rum14  0.29 0.69  0.21 0.61 0.39 1.5
## Rum15  0.70 0.24  0.18 0.58 0.42 1.4
## Rum16  0.54 0.47  0.27 0.59 0.41 2.5
## Rum17  0.70 0.15  0.40 0.67 0.33 1.7
## Rum18  0.69 0.25  0.23 0.59 0.41 1.5
## Rum19  0.57 0.47  0.13 0.56 0.44 2.1
## Rum20  0.44 0.33  0.56 0.61 0.39 2.6
## Rum21  0.27 0.10  0.71 0.59 0.41 1.3
## Rum22  0.43 0.41  0.40 0.51 0.49 3.0
## 
##                        RC1  RC3  RC2
## SS loadings           4.45 4.03 3.22
## Proportion Var        0.20 0.18 0.15
## Cumulative Var        0.20 0.39 0.53
## Proportion Explained  0.38 0.34 0.27
## Cumulative Proportion 0.38 0.73 1.00
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  463.32  with prob <  1.9e-29 
## 
## Fit based upon off diagonal values = 0.97

I showed the eigenvalues once again, so you can see that they're the same. These are based on the actual variance in the data, and not the exact solution you're forcing on the data. Taking a look at the factor loadings, we see that certain items loaded highly on more than 1. In fact, this is when we look at complexity, which tells us how many factors an item loads on. Values close to 1 mean an item loads cleanly on one and only one factor. Values 2 or greater suggest an item is measuring two or more latent constructs simultaneously. Depending on the scale you're trying to create and how you plan to use it, this could be very problematic.

However, we do know that these subscales are correlated with each other. Let's see what happens when we use an oblique rotation instead.

RRSpca_3ob<-principal(RRS, nfactors = 3, rotate="promax")
RRSpca_3ob
## Principal Components Analysis
## Call: principal(r = RRS, nfactors = 3, rotate = "promax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##         RC1   RC3   RC2   h2   u2 com
## Rum1   0.67 -0.06  0.08 0.45 0.55 1.0
## Rum2  -0.29  0.79  0.15 0.48 0.52 1.3
## Rum3   0.41  0.39 -0.30 0.38 0.62 2.8
## Rum4   0.00  0.84 -0.23 0.58 0.42 1.1
## Rum5   0.70 -0.01 -0.08 0.42 0.58 1.0
## Rum6   0.20  0.64 -0.13 0.54 0.46 1.3
## Rum7   0.36 -0.05  0.51 0.53 0.47 1.8
## Rum8   0.15  0.37  0.25 0.42 0.58 2.1
## Rum9  -0.09  0.78  0.04 0.54 0.46 1.0
## Rum10 -0.01  0.54  0.28 0.50 0.50 1.5
## Rum11  0.06  0.02  0.75 0.63 0.37 1.0
## Rum12 -0.21 -0.05  0.82 0.52 0.48 1.1
## Rum13  0.67 -0.06 -0.03 0.39 0.61 1.0
## Rum14  0.03  0.74  0.05 0.61 0.39 1.0
## Rum15  0.80 -0.03 -0.03 0.58 0.42 1.0
## Rum16  0.45  0.33  0.08 0.59 0.41 1.9
## Rum17  0.79 -0.18  0.23 0.67 0.33 1.3
## Rum18  0.77 -0.02  0.03 0.59 0.41 1.0
## Rum19  0.52  0.34 -0.09 0.56 0.44 1.8
## Rum20  0.31  0.15  0.46 0.61 0.39 2.0
## Rum21  0.16 -0.11  0.73 0.59 0.41 1.1
## Rum22  0.31  0.28  0.27 0.51 0.49 3.0
## 
##                        RC1  RC3  RC2
## SS loadings           4.76 4.02 2.93
## Proportion Var        0.22 0.18 0.13
## Cumulative Var        0.22 0.40 0.53
## Proportion Explained  0.41 0.34 0.25
## Cumulative Proportion 0.41 0.75 1.00
## 
##  With component correlations of 
##      RC1  RC3  RC2
## RC1 1.00 0.68 0.52
## RC3 0.68 1.00 0.46
## RC2 0.52 0.46 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  463.32  with prob <  1.9e-29 
## 
## Fit based upon off diagonal values = 0.97

In some cases, this improved our solution, bringing complexity below 2. In others, such as for item 3, it made the problem worse. But using this rotation, we only have a few items we'd need to drop if we wanted to ensure a simple solution - one in which each item loads significantly onto only 1 factor. Just for fun, let's see what happens if we use no rotation.

RRSpca_3no<-principal(RRS, nfactors = 3, rotate="none")
RRSpca_3no
## Principal Components Analysis
## Call: principal(r = RRS, nfactors = 3, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PC1   PC2   PC3   h2   u2 com
## Rum1  0.62  0.06 -0.26 0.45 0.55 1.4
## Rum2  0.53 -0.20  0.41 0.48 0.52 2.2
## Rum3  0.50 -0.35 -0.12 0.38 0.62 1.9
## Rum4  0.57 -0.47  0.21 0.58 0.42 2.2
## Rum5  0.57 -0.07 -0.30 0.42 0.58 1.6
## Rum6  0.64 -0.34  0.09 0.54 0.46 1.6
## Rum7  0.64  0.34 -0.01 0.53 0.47 1.5
## Rum8  0.64  0.02  0.13 0.42 0.58 1.1
## Rum9  0.62 -0.27  0.29 0.54 0.46 1.9
## Rum10 0.66 -0.02  0.25 0.50 0.50 1.3
## Rum11 0.61  0.48  0.19 0.63 0.37 2.1
## Rum12 0.35  0.55  0.30 0.52 0.48 2.3
## Rum13 0.55 -0.01 -0.29 0.39 0.61 1.5
## Rum14 0.70 -0.25  0.24 0.61 0.39 1.5
## Rum15 0.69 -0.03 -0.33 0.58 0.42 1.4
## Rum16 0.76 -0.09 -0.05 0.59 0.41 1.0
## Rum17 0.74  0.20 -0.30 0.67 0.33 1.5
## Rum18 0.71  0.01 -0.30 0.59 0.41 1.3
## Rum19 0.71 -0.20 -0.12 0.56 0.44 1.2
## Rum20 0.75  0.23  0.05 0.61 0.39 1.2
## Rum21 0.57  0.50  0.11 0.59 0.41 2.0
## Rum22 0.71  0.06  0.04 0.51 0.49 1.0
## 
##                        PC1  PC2  PC3
## SS loadings           8.87 1.69 1.14
## Proportion Var        0.40 0.08 0.05
## Cumulative Var        0.40 0.48 0.53
## Proportion Explained  0.76 0.14 0.10
## Cumulative Proportion 0.76 0.90 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  463.32  with prob <  1.9e-29 
## 
## Fit based upon off diagonal values = 0.97

We still have items loading onto more than 1 factor, but not nearly as many as when we use varimax rotation. The promax rotation seems to be superior here, but using no rotation produces really similar results. Also, notice that our RMSEA and goodness of fit have improved for the 3 factor solution. But if this were my brand new measure, and these were my pilot data, based on the eigenvalues, I'd probably opt for the 1 factor solution.

Hope you enjoyed this post! I still prefer CFA for this kind of analysis, and in my current job, rarely even use the PCA results provided by the Rasch program - we tend to use our content validation study results as support for dimensionality - but I have used PCA to analyze some survey data we collected. Each analysis technique can be useful in certain ways and for certain situations.

Tuesday, April 17, 2018

Beautiful Display of Beatles Songs

I briefly interrupt this Blogging A to Z to bring you this gorgeous poster that breaks down Beatles songs by the instruments included:


As described by Mental Floss:
"Come Together," [Pop Chart Lab]'s latest poster, breaks down the instruments featured in every single one of The Beatles's songs, from 1963's "I Saw Her Standing There" to 1970's "Get Back." The chart is broken down into five colors—one for each member of the Fab Four, plus one hue to represent various non-band members—and the icons show you which instrument each member plays in each tune, from the conventional (guitar) to the unique (tape loops and mellotrons).

O is for Overview Reports

O is for overview Reports with dataMaid One of the best things you can do is to create a study codebook to accompany your dataset. In it, you should include information about the study variables and how they were created/computed. It's also nice to have a summary of your dataset, all in one place, so you can quickly check for any issues and begin cleaning your data, and/or plan your analysis. But this can be a rather tedious process of creating and formatting said report, and running the various descriptive statistics and plots. But what if an R package could do much of that for you?

dataMaid to the rescue! In addition to having some great functions to help streamline data cleaning, dataMaid can create an overview report of your dataset, containing the information you request, and generate an R Markdown file to which you could add descriptive information, like functions used to calculate variables, item text, and so on.

For today's post, I'll use the simulated Facebook dataset I created and shared. You can replicate the exact results I get if you use that dataset. After importing the file, I want to make certain all variables are of the correct type. If necessary, I can make some changes. This becomes important when we go to generate our report. I also want to score all my scales, so I have total and subscale scores in the file.

Facebook<-read.delim(file="simulated_facebook_set.txt", header=TRUE)
str(Facebook)
## 'data.frame': 257 obs. of  111 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender        : int  1 1 1 0 1 0 1 0 1 1 ...
##  $ Rum1          : int  3 2 3 2 2 2 0 4 1 0 ...
##  $ Rum2          : int  1 2 4 2 4 1 1 2 2 2 ...
##  $ Rum3          : int  3 3 2 2 2 2 0 1 2 2 ...
##  $ Rum4          : int  1 2 4 0 2 4 1 0 2 2 ...
##  $ Rum5          : int  3 1 2 2 1 3 1 0 0 2 ...
##  $ Rum6          : int  2 3 4 3 2 2 1 1 3 2 ...
##  $ Rum7          : int  3 1 4 3 0 3 3 4 3 2 ...
##  $ Rum8          : int  1 2 4 1 0 1 1 1 3 3 ...
##  $ Rum9          : int  3 0 2 0 1 0 3 2 2 0 ...
##  $ Rum10         : int  1 1 2 2 2 2 2 0 1 1 ...
##  $ Rum11         : int  1 0 0 3 1 2 3 0 4 3 ...
##  $ Rum12         : int  0 2 2 1 0 1 0 2 2 0 ...
##  $ Rum13         : int  4 2 3 3 3 2 1 1 2 2 ...
##  $ Rum14         : int  0 1 3 1 2 2 2 2 4 2 ...
##  $ Rum15         : int  2 2 1 2 2 2 2 1 3 0 ...
##  $ Rum16         : int  2 4 4 0 1 2 0 1 2 4 ...
##  $ Rum17         : int  1 2 2 2 1 3 2 1 2 3 ...
##  $ Rum18         : int  2 2 4 1 2 2 2 1 1 1 ...
##  $ Rum19         : int  0 2 2 1 2 4 2 2 1 0 ...
##  $ Rum20         : int  1 1 2 2 1 1 1 2 4 2 ...
##  $ Rum21         : int  2 2 1 1 1 1 1 3 4 0 ...
##  $ Rum22         : int  2 1 2 2 1 2 0 1 1 1 ...
##  $ Sav1          : int  5 6 7 4 5 6 6 7 6 7 ...
##  $ Sav2          : int  3 2 6 2 2 6 3 6 3 2 ...
##  $ Sav3          : int  7 7 7 6 7 5 6 6 7 7 ...
##  $ Sav4          : int  4 5 5 4 3 5 1 5 2 5 ...
##  $ Sav5          : int  7 6 7 7 5 6 6 6 4 6 ...
##  $ Sav6          : int  4 0 6 4 2 2 3 4 6 4 ...
##  $ Sav7          : int  3 6 5 6 7 7 7 7 7 6 ...
##  $ Sav8          : int  2 2 3 3 2 4 3 3 3 5 ...
##  $ Sav9          : int  6 4 6 6 6 6 6 7 6 5 ...
##  $ Sav10         : int  2 4 1 2 1 3 2 5 1 1 ...
##  $ Sav11         : int  3 3 6 2 6 6 4 1 3 4 ...
##  $ Sav12         : int  0 3 3 3 4 4 3 4 5 3 ...
##  $ Sav13         : int  3 7 7 4 4 3 5 5 7 4 ...
##  $ Sav14         : int  2 2 5 0 3 2 2 2 3 2 ...
##  $ Sav15         : int  5 6 5 5 4 7 4 6 7 7 ...
##  $ Sav16         : int  3 2 2 6 2 3 1 3 2 2 ...
##  $ Sav17         : int  6 3 6 6 5 4 6 6 6 5 ...
##  $ Sav18         : int  2 2 2 3 2 6 3 2 1 2 ...
##  $ Sav19         : int  6 7 6 6 6 7 2 4 6 4 ...
##  $ Sav20         : int  3 2 3 4 6 6 6 3 3 7 ...
##  $ Sav21         : int  6 3 3 6 4 7 6 6 7 4 ...
##  $ Sav22         : int  1 4 2 2 2 2 2 2 3 3 ...
##  $ Sav23         : int  7 7 6 4 6 7 6 4 6 5 ...
##  $ Sav24         : int  2 1 5 1 1 1 3 1 2 2 ...
##  $ LS1           : int  3 5 6 4 7 4 4 6 6 7 ...
##  $ LS2           : int  6 2 6 5 7 5 6 5 6 4 ...
##  $ LS3           : int  7 4 6 4 3 6 5 2 6 6 ...
##  $ LS4           : int  2 6 6 3 6 6 5 6 7 5 ...
##  $ LS5           : int  3 6 7 5 4 4 4 1 4 2 ...
##  $ Extraverted   : int  5 4 6 6 3 3 7 4 4 6 ...
##  $ Critical      : int  5 5 5 3 4 5 1 6 6 5 ...
##  $ Dependable    : int  6 6 6 5 6 6 6 6 7 7 ...
##  $ Anxious       : int  6 6 6 4 6 6 5 6 5 4 ...
##  $ NewExperiences: int  7 6 6 6 6 6 7 6 3 6 ...
##  $ Reserved      : int  3 4 6 5 5 5 3 4 7 2 ...
##  $ Sympathetic   : int  7 6 7 6 5 6 3 7 6 6 ...
##  $ Disorganized  : int  6 5 6 5 5 5 3 4 7 5 ...
##  $ Calm          : int  6 7 6 5 6 7 6 3 5 6 ...
##  $ Conventional  : int  3 4 2 3 2 2 2 2 3 3 ...
##  $ Health1       : int  1 1 4 2 1 3 2 3 3 0 ...
##  $ Health2       : int  2 1 1 0 2 1 0 2 2 1 ...
##  $ Health3       : int  2 1 2 0 1 2 3 2 2 1 ...
##  $ Health4       : int  0 2 0 0 1 0 0 1 0 0 ...
##  $ Health5       : int  1 1 1 0 1 1 1 0 3 0 ...
##  $ Health6       : int  0 0 2 0 0 0 1 0 1 0 ...
##  $ Health7       : int  0 0 3 2 0 1 1 2 1 0 ...
##  $ Health8       : int  2 3 4 2 1 1 1 0 2 2 ...
##  $ Health9       : int  2 3 3 1 2 4 4 2 2 3 ...
##  $ Health10      : int  0 0 1 1 0 1 1 1 0 0 ...
##  $ Health11      : int  0 1 2 1 2 0 0 0 2 0 ...
##  $ Health12      : int  0 2 1 2 0 1 1 0 0 0 ...
##  $ Health13      : int  0 2 2 0 2 3 1 1 0 1 ...
##  $ Health14      : int  2 2 3 0 0 2 1 2 2 0 ...
##  $ Health15      : int  2 1 1 0 1 0 0 0 1 0 ...
##  $ Health16      : int  1 1 3 0 2 3 2 1 0 0 ...
##  $ Health17      : int  0 0 0 2 2 3 0 2 1 0 ...
##  $ Health18      : int  1 4 1 1 0 0 0 1 2 0 ...
##  $ Health19      : int  0 2 2 0 0 1 0 0 2 1 ...
##  $ Health20      : int  2 1 2 0 1 1 0 0 0 0 ...
##  $ Health21      : int  1 0 1 1 0 2 0 1 1 1 ...
##  $ Health22      : int  3 1 2 0 2 4 2 2 0 2 ...
##  $ Health23      : int  1 0 3 2 2 0 2 3 2 2 ...
##  $ Health24      : int  0 0 1 1 1 0 0 2 1 0 ...
##  $ Health25      : int  0 3 1 2 2 0 2 0 1 0 ...
##  $ Health26      : int  1 0 0 0 0 0 2 1 1 2 ...
##  $ Health27      : int  2 1 0 1 1 0 0 1 0 1 ...
##  $ Health28      : int  0 3 2 0 1 3 0 2 3 2 ...
##  $ Health29      : int  1 2 1 0 1 1 2 1 1 2 ...
##  $ Health30      : int  0 0 0 2 0 0 0 0 0 0 ...
##  $ Health31      : int  1 0 0 0 0 2 1 0 0 0 ...
##  $ Health32      : int  2 1 2 1 2 2 2 1 2 0 ...
##  $ Dep1          : int  0 0 2 0 1 1 1 1 0 0 ...
##  $ Dep2          : int  0 1 0 0 0 0 0 0 1 2 ...
##  $ Dep3          : int  0 1 0 0 0 0 1 0 2 2 ...
##  $ Dep4          : int  1 0 1 1 0 0 0 1 1 0 ...
##   [list output truncated]
Facebook$ID<-as.character(Facebook$ID)
Facebook$gender<-factor(Facebook$gender, labels=c("Male","Female"))
Rumination<-Facebook[,3:24]
Savoring<-Facebook[,25:48]
SatwithLife<-Facebook[,49:53]
CHIPS<-Facebook[,64:95]
CESD<-Facebook[,96:111]
Facebook$RRS<-rowSums(Facebook[,3:24])
Facebook$RRS_D<-rowSums(Facebook[,c(3,4,5,6,8,10,11,16,19,20,21,24)])
Facebook$RRS_R<-rowSums(Facebook[,c(9,13,14,22,23)])
Facebook$RRS_B<-rowSums(Facebook[,c(7,12,15,17,18)])
reverse<-function(max,min,x) {
  y<-(max+min)-x
  return(y)
  }
Facebook$Sav2R<-reverse(7,1,Facebook$Sav2)
Facebook$Sav4R<-reverse(7,1,Facebook$Sav4)
Facebook$Sav6R<-reverse(7,1,Facebook$Sav6)
Facebook$Sav8R<-reverse(7,1,Facebook$Sav8)
Facebook$Sav10R<-reverse(7,1,Facebook$Sav10)
Facebook$Sav12R<-reverse(7,1,Facebook$Sav12)
Facebook$Sav14R<-reverse(7,1,Facebook$Sav14)
Facebook$Sav16R<-reverse(7,1,Facebook$Sav16)
Facebook$Sav18R<-reverse(7,1,Facebook$Sav18)
Facebook$Sav20R<-reverse(7,1,Facebook$Sav20)
Facebook$Sav22R<-reverse(7,1,Facebook$Sav22)
Facebook$Sav24R<-reverse(7,1,Facebook$Sav24)
Facebook$SBI<-Facebook$Sav2R+Facebook$Sav4R+Facebook$Sav6R+
  Facebook$Sav8R+Facebook$Sav10R+Facebook$Sav12R+Facebook$Sav14R+
  Facebook$Sav16R+Facebook$Sav18R+Facebook$Sav20R+Facebook$Sav22R+
  Facebook$Sav24R+Facebook$Sav1+Facebook$Sav3+Facebook$Sav5+
  Facebook$Sav7+Facebook$Sav9+Facebook$Sav11+Facebook$Sav13+Facebook$Sav15+
  Facebook$Sav17+Facebook$Sav19+Facebook$Sav21+Facebook$Sav23
Facebook$SavPos<-Facebook$Sav2R+Facebook$Sav4R+Facebook$Sav6R+
  Facebook$Sav8R+Facebook$Sav10R+Facebook$Sav12R+Facebook$Sav14R+
  Facebook$Sav16R+Facebook$Sav18R+Facebook$Sav20R+Facebook$Sav22R+
  Facebook$Sav24R
Facebook$SavNeg<-Facebook$Sav1+Facebook$Sav3+Facebook$Sav5+
  Facebook$Sav7+Facebook$Sav9+Facebook$Sav11+Facebook$Sav13+Facebook$Sav15+
  Facebook$Sav17+Facebook$Sav19+Facebook$Sav21+Facebook$Sav23
Facebook$Anticipating<-Facebook$Sav1+Facebook$Sav4R+Facebook$Sav7+
  Facebook$Sav10R+Facebook$Sav13+Facebook$Sav16R+Facebook$Sav19+Facebook$Sav22R
Facebook$Moment<-Facebook$Sav2R+Facebook$Sav5+Facebook$Sav8R+
  Facebook$Sav11+Facebook$Sav14R+Facebook$Sav17+Facebook$Sav20R+Facebook$Sav23
Facebook$Reminiscing<-Facebook$Sav3+Facebook$Sav6R+Facebook$Sav9+
  Facebook$Sav12R+Facebook$Sav15+Facebook$Sav18R+Facebook$Sav21+Facebook$Sav24R
Facebook$SWLS<-rowSums(SatwithLife)
Facebook$CritR<-reverse(7,1,Facebook$Critical)
Facebook$AnxR<-reverse(7,1,Facebook$Anxious)
Facebook$ResR<-reverse(7,1,Facebook$Reserved)
Facebook$DisR<-reverse(7,1,Facebook$Disorganized)
Facebook$ConvR<-reverse(7,1,Facebook$Conventional)
Facebook$Extraversion<-(Facebook$Extraverted+Facebook$ResR)/2
Facebook$Agree<-(Facebook$CritR+Facebook$Sympathetic)/2
Facebook$Consc<-(Facebook$Dependable+Facebook$DisR)/2
Facebook$EmoSt<-(Facebook$AnxR+Facebook$Calm)/2
Facebook$Openness<-(Facebook$NewExperiences+Facebook$ConvR)/2
Facebook$Health<-rowSums(CHIPS)
Facebook$Dep4R<-reverse(3,0,Facebook$Dep4)
Facebook$Dep8R<-reverse(3,0,Facebook$Dep8)
Facebook$Dep12R<-reverse(3,0,Facebook$Dep12)
Facebook$CESD<-Facebook$Dep1+Facebook$Dep2+Facebook$Dep3+Facebook$Dep4R+Facebook$Dep5+
  Facebook$Dep6+Facebook$Dep7+Facebook$Dep8R+Facebook$Dep9+Facebook$Dep10+Facebook$Dep11+
  Facebook$Dep12R+Facebook$Dep13+Facebook$Dep14+Facebook$Dep15+Facebook$Dep16
library(dataMaid)
## Loading required package: ggplot2

To generate an overview report, we'll use the function, makeDataReport, which gives you a great overview of your dataset, with summary statistics, some analysis to assist with data cleaning, and customization options. And you can very easily add codebook information to your data report once it's in R Markdown format.

The makeDataReport function has many arguments, which you can read all about here. Today, we'll focus on some of the key arguments in the function. The very first argument is the dataset you want to perform the function on, in this case Facebook. You could run the function with only this argument if you wanted, and this might be fine if you have a small dataset and want to take a quick look.

Next, define what kind of file you want the function to output, "pdf", "word", or "html". The default, NULL, produces one kind of output based on three checks: 1) Is there a LaTeX installation available? If yes, PDF. If no: 2) Does the computer have Windows? If yes, Word. If no:, 3) HTML if the first two checks produces no's. After that is render, meaning R will render whatever output file you select, and save that file. If you plan on making changes to the R Markdown file after it is created, you want to set this to FALSE. The .Rmd file will automatically open, for you to edit, and will be saved in your working directory.

You can also specify file name, and/or a volume number (for updated reports) with vol=#. If you've already generated a report and try to create a new one without specifying a new file name, you'll get an error; you can override that and force dataMaid to save over the existing file with replace=TRUE.

Putting some of these arguments together will generate a report for that includes all variables in the dataset, with default summaries and visualization.

makeDataReport(Facebook, output="html", render=FALSE, file="report_Facebook.Rmd")

You can take a look at the example report I generated here. In fact, by running this report, I noticed some problems with the simulated dataset. Instead of using R to generate it for me, I used Winsteps (the Rasch program I use). I didn't notice until now that some of the items have values out of range for the rating scale used, such as Rumination item scores greater than 3. Thanks to this report, I identified (and fixed) these problems, then updated the simulated dataset available here. (This the same link used in previous posts; regardless of where you're clicking from, this will take you to the updated file.) I also noticed that I miscoded gender when creating that factor, and switched Male and Female. So I can go back up to that code and fix that as well.

But a report with every single variable might be more than I actually I want. I may want to specify only certain variables, such as those that appear problematic based on the built-in cleaning analysis dataMaid does. I can set that with onlyProblematic=TRUE:

makeDataReport(Facebook, output="html", render=FALSE, file="report_Facebook_sum.Rmd",
               onlyProblematic = TRUE)

You can check out that report here.

I always used "render=FALSE" so I could make changes to the report before knitting it to an HTML document. Since this opens an R Markdown file, I could add whatever information I wanted to the report, just by clicking to that point in the document and adding the information I want. For instance, I could add a note on the reverse function I used to reverse-code variables. I could include code for how I defined the gender factor - just to make certain I don't do it wrong again! You can add code to only display (rather than run) by putting ' on either side of it. I generated one last data report, using the cleaned data, with gender factor correctly coded, and added some codebook details. You can take a look at that document here.

Monday, April 16, 2018

N is for N (Sample Size) Estimation: Power Analysis in R

N Estimation We're pushing forward in the Blogging A to Z Challenge! Today, I'll talk about conducting power analyses in R with the pwr package. It's amazing the number of studies designed and conducted without the aid of power analysis. Once I learned how to do them in grad school, I couldn't see any other way to design a study - how else do you know how many people you need? Conducting a study is no guarantee you'll find a significant effect, but if there is one to be found, you want to make sure you're maximizing the chance that you'll find it. Otherwise, you might go to a lot of time and effort to do a study and then have nothing to show for it.

To estimate sample, you need a few pieces of information, one of which is the effect size you're hoping to detect. Later this month, I'll show how to conduct a meta-analysis in R, which is really useful to do on a small scale to generate this effect size and help inform a power analysis. I'll talk about that in those future posts, and I'm hoping to write an article for a methods journal going into detail about how to do a mini meta-analysis, using the one I did for my dissertation as a working example.

The other pieces of information you need, besides effect size, are:
  • The statistical analysis you plan to use.
  • Power - 1-Beta; convention is 0.8, or an 80% chance that, if there is a significant difference, you'll detect it, but 0.9 is sometimes used.
  • Significance level - what alpha you'll be using for your significance tests; 0.05 is the convention.
The pwr package allows you conduct power analysis for the following statistical analyses:
  • One-Sample Proportion: comparing a proportion to a population value
  • Two-Sample Proportion: comparing proportions from two samples with either equal or unequal n's
  • One-Sample T-Test: comparing a sample mean to a population mean
  • Paired-Sample T-Test: comparing means at two time points in one sample or from two matched samples
  • Two-Sample T-Test: comparing means from two samples with either equal or unequal n's
  • Chi-Square Test
  • One-Way ANOVA: comparing means from three or more samples
  • General Linear Model Test: including regressions and Factorial ANOVAs
  • Correlation
Each power analysis function requires 3 of the 4 pieces of information: effect size, sample size, significance level, and power. To estimate sample size, leave that piece of information out of the function, or set its value as NULL. But you could also use these functions to conduct a post-hoc power analysis - what was your power given the effect size, significance level, and sample size you used? To do that, you would just leave out power.

First, let's do a very simple power analysis. In last year's Blogging A to Z, I used an ongoing example of a (fictional) study on the impact of caffeine on test performance, and I demonstrated how to conduct a t-test using some data I generated. Descriptive statistics for the two groups showed:

Experimental group: M = 83.2, SD = 6.21
Control group: M = 79.3, SD = 6.40

I can compute an effect size - for two means, the effect size you'd use is Cohen's d.

caffeine<-read.delim("caffeine_study.txt", header=TRUE)
caffeine$group<-factor(caffeine$group, labels=c("Control","Experimental"))
library(effsize)
## Warning: package 'effsize' was built under R version 3.4.4
cohen.d(caffeine$score, caffeine$group, pooled=TRUE, conf.level=0.95)
## 
## Cohen's d
## 
## d estimate: 0.62398 (medium)
## 95 percent confidence interval:
##        inf        sup 
## 0.09471117 1.15324889

My effect size is about 0.624. Let's say you want to replicate my study, but rather than just picking a random number of people to have in each group, you want to determine how many you need to detect an effect of that size. For that, we'll use the pwr package.

library(pwr)
## Warning: package 'pwr' was built under R version 3.4.4
pwr.t.test(d=0.624, sig.level=0.05, power=0.8,
           type = "two.sample", alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 41.29776
##               d = 0.624
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

To detect an effect of that size, you want about 41 people per group, if you're conducting a two-sided test. If you instead want one-sided, you'd specify whether you expect your comparison group to be "less" or "greater".

pwr.t.test(d=0.624, sig.level=0.05, power=0.8,
           type = "two.sample", alternative = "greater")
## 
##      Two-sample t test power calculation 
## 
##               n = 32.45438
##               d = 0.624
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater
## 
## NOTE: n is number in *each* group

So if you predict ahead of time that the experimental group will have a higher score, and only want to look for a significant difference in that direction, you need about 32 people per group. This tells me that the fake study I designed (and just randomly picked sample sizes for) was slightly underpowered, since I only had 30 people in each group. (Not to mention these are fake data I set up to have a very large effect. Effects as large as d = 0.624 aren't all that common in the wild.)

What if you wanted to do something a bit fancier, like estimate necessary sample size for a range of potential effect sizes? This next example is based on a program I recently wrote for a problem at work. (Values and some details have been changed.) A state board requires school programs to have a pass rate on a key exam of 80%. If a program misses that pass rate one year, they go on probation. If they miss it for 3 consecutive years, they are suspended and can no longer be considered a state board-approved school. The board has called up a statistician (hi) to see if a handful of school programs have a pass rate significantly lower than this 80% cutoff. The problem is, some of these school programs are small, and as a result, their pass rates may vary a lot from year to year, just by luck of the draw.

In addition to sharing a report of the study results, the statistician wants to include a power analysis, to show how many cases are necessary before she can reliably say a pass rate is too low. She wants to show necessary sample for a range of school program pass rates (say between 50% and 79%), and for two levels of power, 0.8 (convention) and 0.9. Here's how we could go about conducting that power analysis.

First, we need to generate our effect sizes. This is a one-sample proportion, where we're comparing a range of school pass rates (50-79%) to a population value (80%). In the pwr package, the effect size for this proportion is called h. You can access it with the pwr function, ES.h.

p1s <- seq(.5,.79,.01)
h <- ES.h(p1 = p1s, p2 = 0.80)
nh <- length(h)

Next, we'll create a list of our two power levels>

p <- c(.8,.9)
np <- length(p)

You'll notice I also requested the length of these lists. That becomes important now when I tell R to combine these two to start generating my target sample sizes, using a for loop.

sizes <- array(numeric(nh*np), dim=c(nh,np))
for (i in 1:np){
  for (j in 1:nh){
    pow_an <- pwr.p.test(n = NULL, h = h[j],
                         sig.level = .05, power = p[i],
                         alternative = "less")
    sizes[j,i] <- ceiling(pow_an$n)
  }
}

Now I have data in two columns, target sample sizes for each value of h as my rows, with power = 0.8 in one column and power = 0.9 in the other. I'll rename my columns, so I can easily see which is which, and I'll add in the school pass rates I used, so that I can plot my results with that information on the x-axis.

samp <- data.frame(cbind(p1s,sizes))
colnames(samp)<-c("Pass_Rate","Power.8", "Power.9")
summary(samp)
##    Pass_Rate         Power.8           Power.9        
##  Min.   :0.5000   Min.   :   15.0   Min.   :   21.00  
##  1st Qu.:0.5725   1st Qu.:   25.5   1st Qu.:   34.75  
##  Median :0.6450   Median :   51.0   Median :   71.00  
##  Mean   :0.6450   Mean   :  554.2   Mean   :  767.70  
##  3rd Qu.:0.7175   3rd Qu.:  167.2   3rd Qu.:  231.00  
##  Max.   :0.7900   Max.   :10075.0   Max.   :13956.00

For power of 0.8, necessary sample size ranges from 15 to over 10,000. For power of 0.9, the range is 21 to almost 14,000. Now I can start plotting my results from my samp data frame. If I only wanted to show results for one power level (0.8):

library(ggplot2)
passrates<-ggplot(samp, aes(x=p1s, y = Power.8)) +
  geom_point() +
  geom_line()
passrates + scale_y_continuous(breaks=seq(0,11000,1000)) +
  scale_x_continuous(labels = scales::percent) +
  labs(title = "Comparing Pass Rates Below 80%",
       x = "Pass Rate",
       y = "N Needed to Detect Significant Difference") +
       theme(plot.title=element_text(hjust=0.5))


If I wanted to plot both:

bothpowers<-ggplot(samp, aes(p1s)) +
  geom_line(aes(y=Power.8, colour="0.8")) +
  geom_line(aes(y=Power.9, colour="0.9")) +
  scale_y_continuous(breaks=seq(0,14000,1000)) +
  scale_x_continuous(labels = scales::percent) +
  labs(title = "Comparing Pass Rates Below 80%",
       x = "Pass Rate", y = "N Needed to Detect Significant Difference", colour="Power") +
  theme(plot.title=element_text(hjust=0.5))
bothpowers


It doesn't appear that bumping up to 0.9 power makes a huge difference in terms of sample size, except for the smallest comparison - 79% vs. 80%. But either way, the recommendation I would make to the state board is that pass rates should be 70% or less before we can say a program is reliably falling below the board's cutoff. This is especially true for smaller programs. Instead, for programs close to the cutoff, some form of remediation may be a better strategy. If they consistently fall below 70%, then drastic action could be taken.

You can learn more about the pwr package in its manual.