## Friday, April 6, 2018

### F is for (Confirmatory) Factor Analysis

Title Back in February and March, I shared three posts on structural equation modeling: one introducing exogenous and endogenous variables, one introducing factor analysis, and one on how factor analysis can be used in psychometric research. Today, I'll demonstrate how to conduct confirmatory factor analysis on measurement (psychometric) data - and look for my next Statistics Sunday post where we dig into understanding and interpreting output from this analysis.

To give a bit of background, I learned structural equation modeling with LISREL. LISREL (LInear Structural RELations) is not only software to conduct SEM; it's also a specific type of notation, developed by Jöreskog, Keesling, and Wiley, that breaks a structural model into matrices. Conducting SEM with LISREL involves typing up a complex control file, defining each matrix by its Greek name from the notation system - Lambda x's and y's to describe factor loadings for exogenous and endogenous variables, respectively; phi to describe the variances and covariances of exogenous latent variables; and so on.

Matrices are an essential part of the math behind statistical analysis, and are also a valuable shorthand to describe complex relationships. When LISREL software was first developed in the 1970s, it made sense that matrices were specified to break down the huge amount of data that would need to be processed for a structural model. Even then, this kind of analysis took time and a lot of computing power.

Today, I'm running on my tiny Macbook Air something that required a mainframe. I don't want to downplay the importance of this matrix approach, because that's still going on on the back end, but when computational power comes cheap and easy, it feels needlessly complicated or, at the very least, outdated to require each matrix be specified to conduct the analysis. And it creates a high intellectual bar for anyone who wants to learn SEM. Is it nice to know that the gamma matrix describes causal paths between exogenous and endogenous variables? Sure. But is it essential to conduct and understand SEM? No, I don't think so.

If you're like me, when you design a structural model to describe your data, you're drawing your variables, some of which are measured directly (observed), some of which are assessed indirectly by combinations of measured variables (latent), and drawing arrows to show how they relate to each other. To use the LISREL approach, I would have to take what I drew and convert it to matrices before I could conduct my analysis. And I would constantly have to double-check which matrix referred to what. Wouldn't it be nice to go straight from model drawing to analysis?

My favorite way to conduct SEM is with an R package called lavaan. Rather than specifying matrices, lavaan has you convert the components of your model into equations, which is a much more direct conversion from a model drawing. A factor is defined followed by the observed variables assessing that factor, as follows:

Factor =~ var1 + var2 + ... + lastvar

It's structured like a linear equation because, at the basic level, that's exactly what it is. You could conduct a factor analysis with 1 or more factors, and factors can be correlated or uncorrelated (also referred to as orthogonal). Hold that thought for now - later on, I'll talk about how orthogonal models can be used for hypothesis testing.

Using my Facebook dataset, I could test many factor models. Three of the measures included in the study assess one topic without subscales, so those could be tested with a single factor model. For simplicity, let's start with one of those. The Diener Satisfaction with Life Scale is a 5-item measure that assesses how satisfied one is with his/her life overall. We would begin by specifying a single factor model. We can name the factor and the overall model whatever we want, but need to use the actual variable names when referring to the observed variables used to assess the factor:

```Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)
SWL_Model<-'SWL =~ LS1 + LS2 + LS3 + LS4 + LS5'
```

Here's what this model looks like drawn out:

This object we created is then used in lavaan to fit our model. You'll want to load lavaan (install if necessary) before proceeding to the next step:

```install.packages("lavaan")
```
```## Installing package into '\\marge/users\$/slocatelli/My Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
```
```library(lavaan)
```
```## This is lavaan 0.5-23.1097
```
```## lavaan is BETA software! Please report any bugs.
```

Now, let's fit our model and check our output:

```SWL_Fit<-cfa(SWL_Model, data=Facebook)
summary(SWL_Fit)
```
```## lavaan (0.5-23.1097) converged normally after  24 iterations
##
##   Number of observations                           257
##
##   Estimator                                         ML
##   Minimum Function Test Statistic               26.760
##   Degrees of freedom                                 5
##   P-value (Chi-square)                           0.000
##
## Parameter Estimates:
##
##   Information                                 Expected
##   Standard Errors                             Standard
##
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   SWL =~
##     LS1               1.000
##     LS2               0.974    0.071   13.625    0.000
##     LS3               0.969    0.065   14.909    0.000
##     LS4               0.855    0.071   12.038    0.000
##     LS5               0.790    0.085    9.290    0.000
##
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .LS1               0.984    0.111    8.869    0.000
##    .LS2               0.890    0.102    8.742    0.000
##    .LS3               0.493    0.073    6.744    0.000
##    .LS4               1.124    0.115    9.793    0.000
##    .LS5               2.092    0.197   10.641    0.000
##     SWL               1.678    0.229    7.332    0.000
```

Seriously, that's all it takes to conduct SEM with lavaan. We'll dig into interpretation later - for now, note that all observed variables load significantly onto the latent variable SWL - but two things you will want to add when examining the summary output is the standardized solution and fit measures:

```summary(SWL_Fit, standardized=TRUE, fit.measures=TRUE)
```
```## lavaan (0.5-23.1097) converged normally after  24 iterations
##
##   Number of observations                           257
##
##   Estimator                                         ML
##   Minimum Function Test Statistic               26.760
##   Degrees of freedom                                 5
##   P-value (Chi-square)                           0.000
##
## Model test baseline model:
##
##   Minimum Function Test Statistic              635.988
##   Degrees of freedom                                10
##   P-value                                        0.000
##
## User model versus baseline model:
##
##   Comparative Fit Index (CFI)                    0.965
##   Tucker-Lewis Index (TLI)                       0.930
##
## Loglikelihood and Information Criteria:
##
##   Loglikelihood user model (H0)              -2111.647
##   Loglikelihood unrestricted model (H1)      -2098.267
##
##   Number of free parameters                         10
##   Akaike (AIC)                                4243.294
##   Bayesian (BIC)                              4278.785
##   Sample-size adjusted Bayesian (BIC)         4247.082
##
## Root Mean Square Error of Approximation:
##
##   RMSEA                                          0.130
##   90 Percent Confidence Interval          0.084  0.181
##   P-value RMSEA <= 0.05                          0.003
##
## Standardized Root Mean Square Residual:
##
##   SRMR                                           0.040
##
## Parameter Estimates:
##
##   Information                                 Expected
##   Standard Errors                             Standard
##
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   SWL =~
##     LS1               1.000                               1.295    0.794
##     LS2               0.974    0.071   13.625    0.000    1.262    0.801
##     LS3               0.969    0.065   14.909    0.000    1.256    0.873
##     LS4               0.855    0.071   12.038    0.000    1.107    0.722
##     LS5               0.790    0.085    9.290    0.000    1.023    0.578
##
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .LS1               0.984    0.111    8.869    0.000    0.984    0.370
##    .LS2               0.890    0.102    8.742    0.000    0.890    0.359
##    .LS3               0.493    0.073    6.744    0.000    0.493    0.238
##    .LS4               1.124    0.115    9.793    0.000    1.124    0.478
##    .LS5               2.092    0.197   10.641    0.000    2.092    0.666
##     SWL               1.678    0.229    7.332    0.000    1.000    1.000
```

Next, let's go through how you could specify a multiple factor model. The Ruminative Response Scale assesses 3 subscales: Depression-Related Rumination, Reflecting, and Brooding. So we can specify a 3 factor model, enclosing all 3 equations, one for each factor, within the quote marks. We'll then fit this model in the same way we did our Satisfaction with Life model above. By default, the 3 factors are allowed to correlate:

```RRS_Model<- '
Depression =~ Rum1 + Rum2 + Rum3 + Rum4 + Rum6 + Rum8 +
Rum9 + Rum14 + Rum17 + Rum18 + Rum19 + Rum22
Reflecting =~ Rum7 + Rum11 + Rum12 + Rum20 + Rum21
Brooding =~ Rum5 + Rum10 + Rum13 + Rum15 + Rum16
'
summary(RRS_Fit)
```
```## lavaan (0.5-23.1097) converged normally after  40 iterations
##
##   Number of observations                           257
##
##   Estimator                                         ML
##   Minimum Function Test Statistic              600.311
##   Degrees of freedom                               206
##   P-value (Chi-square)                           0.000
##
## Parameter Estimates:
##
##   Information                                 Expected
##   Standard Errors                             Standard
##
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression =~
##     Rum1              1.000
##     Rum2              0.867    0.124    6.965    0.000
##     Rum3              0.840    0.124    6.797    0.000
##     Rum4              0.976    0.126    7.732    0.000
##     Rum6              1.167    0.140    8.357    0.000
##     Rum8              1.147    0.141    8.132    0.000
##     Rum9              1.095    0.136    8.061    0.000
##     Rum14             1.191    0.135    8.845    0.000
##     Rum17             1.261    0.141    8.965    0.000
##     Rum18             1.265    0.142    8.887    0.000
##     Rum19             1.216    0.135    8.992    0.000
##     Rum22             1.257    0.142    8.870    0.000
##   Reflecting =~
##     Rum7              1.000
##     Rum11             0.906    0.089   10.138    0.000
##     Rum12             0.549    0.083    6.603    0.000
##     Rum20             1.073    0.090   11.862    0.000
##     Rum21             0.871    0.088    9.929    0.000
##   Brooding =~
##     Rum5              1.000
##     Rum10             1.092    0.133    8.216    0.000
##     Rum13             0.708    0.104    6.823    0.000
##     Rum15             1.230    0.143    8.617    0.000
##     Rum16             1.338    0.145    9.213    0.000
##
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression ~~
##     Reflecting        0.400    0.061    6.577    0.000
##     Brooding          0.373    0.060    6.187    0.000
##   Reflecting ~~
##     Brooding          0.419    0.068    6.203    0.000
##
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Rum1              0.687    0.063   10.828    0.000
##    .Rum2              0.796    0.072   11.007    0.000
##    .Rum3              0.809    0.073   11.033    0.000
##    .Rum4              0.694    0.064   10.857    0.000
##    .Rum6              0.712    0.067   10.668    0.000
##    .Rum8              0.778    0.072   10.746    0.000
##    .Rum9              0.736    0.068   10.768    0.000
##    .Rum14             0.556    0.053   10.442    0.000
##    .Rum17             0.576    0.056   10.370    0.000
##    .Rum18             0.611    0.059   10.418    0.000
##    .Rum19             0.526    0.051   10.352    0.000
##    .Rum22             0.609    0.058   10.428    0.000
##    .Rum7              0.616    0.067    9.200    0.000
##    .Rum11             0.674    0.069    9.746    0.000
##    .Rum12             0.876    0.080   10.894    0.000
##    .Rum20             0.438    0.056    7.861    0.000
##    .Rum21             0.673    0.068    9.867    0.000
##    .Rum5              0.955    0.090   10.657    0.000
##    .Rum10             0.663    0.065   10.154    0.000
##    .Rum13             0.626    0.058   10.819    0.000
##    .Rum15             0.627    0.064    9.731    0.000
##    .Rum16             0.417    0.050    8.368    0.000
##     Depression        0.360    0.072    4.987    0.000
##     Reflecting        0.708    0.111    6.408    0.000
##     Brooding          0.455    0.096    4.715    0.000
```

If we wanted our 3 factors to be uncorrelated, we would update the RRS_Fit statement as follows:

```RRS_Fit<-cfa(RRS_Model, data=Facebook, orthogonal=TRUE)
```

One more thing, just for fun - instead of having 3 correlated factors, we might instead believe that our 3 factors in turn assess a single, higher-order factor of Rumination. In fact, this type of model makes more sense with this measure. That is, all 22 items can be summed to obtain an overall Rumination score, and combinations of items are summed to get the 3 subscale scores. If we wanted to test this type of model, we would use the following:

```RRS_HO_Model<- '
Depression =~ Rum1 + Rum2 + Rum3 + Rum4 + Rum6 + Rum8 +
Rum9 + Rum14 + Rum17 + Rum18 + Rum19 + Rum22
Reflecting =~ Rum7 + Rum11 + Rum12 + Rum20 + Rum21
Brooding =~ Rum5 + Rum10 + Rum13 + Rum15 + Rum16
Rumination =~ Depression + Reflecting + Brooding
'

summary(RRS_HO_Fit)
```
```## lavaan (0.5-23.1097) converged normally after  33 iterations
##
##   Number of observations                           257
##
##   Estimator                                         ML
##   Minimum Function Test Statistic              600.311
##   Degrees of freedom                               206
##   P-value (Chi-square)                           0.000
##
## Parameter Estimates:
##
##   Information                                 Expected
##   Standard Errors                             Standard
##
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression =~
##     Rum1              1.000
##     Rum2              0.867    0.124    6.965    0.000
##     Rum3              0.840    0.124    6.797    0.000
##     Rum4              0.976    0.126    7.732    0.000
##     Rum6              1.167    0.140    8.357    0.000
##     Rum8              1.147    0.141    8.132    0.000
##     Rum9              1.095    0.136    8.061    0.000
##     Rum14             1.191    0.135    8.845    0.000
##     Rum17             1.261    0.141    8.965    0.000
##     Rum18             1.265    0.142    8.887    0.000
##     Rum19             1.216    0.135    8.992    0.000
##     Rum22             1.257    0.142    8.870    0.000
##   Reflecting =~
##     Rum7              1.000
##     Rum11             0.906    0.089   10.138    0.000
##     Rum12             0.549    0.083    6.603    0.000
##     Rum20             1.073    0.090   11.862    0.000
##     Rum21             0.871    0.088    9.929    0.000
##   Brooding =~
##     Rum5              1.000
##     Rum10             1.092    0.133    8.216    0.000
##     Rum13             0.708    0.104    6.823    0.000
##     Rum15             1.230    0.143    8.617    0.000
##     Rum16             1.338    0.145    9.213    0.000
##   Rumination =~
##     Depression        1.000
##     Reflecting        1.123    0.144    7.780    0.000
##     Brooding          1.047    0.148    7.067    0.000
##
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Rum1              0.687    0.063   10.828    0.000
##    .Rum2              0.796    0.072   11.007    0.000
##    .Rum3              0.809    0.073   11.033    0.000
##    .Rum4              0.694    0.064   10.857    0.000
##    .Rum6              0.712    0.067   10.668    0.000
##    .Rum8              0.778    0.072   10.746    0.000
##    .Rum9              0.736    0.068   10.768    0.000
##    .Rum14             0.556    0.053   10.442    0.000
##    .Rum17             0.576    0.056   10.370    0.000
##    .Rum18             0.611    0.059   10.418    0.000
##    .Rum19             0.526    0.051   10.352    0.000
##    .Rum22             0.609    0.058   10.428    0.000
##    .Rum7              0.616    0.067    9.200    0.000
##    .Rum11             0.674    0.069    9.746    0.000
##    .Rum12             0.876    0.080   10.893    0.000
##    .Rum20             0.438    0.056    7.861    0.000
##    .Rum21             0.673    0.068    9.867    0.000
##    .Rum5              0.955    0.090   10.657    0.000
##    .Rum10             0.663    0.065   10.154    0.000
##    .Rum13             0.626    0.058   10.819    0.000
##    .Rum15             0.627    0.064    9.731    0.000
##    .Rum16             0.417    0.050    8.368    0.000
##     Depression        0.003    0.017    0.204    0.838
##     Reflecting        0.259    0.051    5.072    0.000
##     Brooding          0.064    0.026    2.446    0.014
##     Rumination        0.356    0.073    4.858    0.000
```

That's all for now! Check back for future posts to better understand output and how to assess and improve your models.