Monday, April 2, 2018

B is for Betas (Standardized Regression Coefficients)

Title Welcome to Day 2 of Blogging A to Z! As with yesterday, the title of today's post is very similar to Day 2 of Blogging A to Z last year, but once again, a different concept.

Regression is used to predict scores on one variable with one or more predictor variables. As I mentioned previously, regression is similar to correlation, which describes (numerically) the strength of the relationship between two variables. But the goal of regression is not just to describe a relationship but to create an equation that allows one to predict scores, or specifically to see if information from the x variable(s) can be used to generate close approximations of the y variable. It might not be used for prediction in the sense of forecasting future events, though that is one way regression results can be used. The conversion of x variable(s) into y variable is accomplished with regression coefficients - values that are multiplied by the value of x to generate a predicted y, which is hopefully very similar to the observed y. A constant is included in a regression equation to help shift the scale - this constant is equal to the mean value for y when x is equal to 0. When people talk about regression, they usually mean linear regression, which is what I'll focus on today. But there are other types of regression for nonlinear relationships between y and x(s).

In a previous post I predicted my rating of books I read last year with a linear regression that included book length, genre, author gender, and how long it took to read the book in days. In that analysis, I found that book length and fantasy genre predicted higher ratings and YA fiction predicted lower ratings. The other variables were not significant.

Going back to the Facebook file I used yesterday, I have many scales I could use in a linear regression. In that study, among other scales, participants reported rumination on the Ruminative Response Scale (RRS) and depression on the Center for Epidemiologic Studies Depression Scale (CES-D). The relationship between rumination and depression is well-established, and though this is a non-clinical sample, we would expect to find a relationship, such that heightened scores on the RRS should predict higher scores on the CES-D. In fact, here's the scatterplot showing the relationship between RRS and CES-D in this sample; as you can see, it's a positive linear relationship:


We could run a simple linear regression with these two variables. First, we need to generate our scale scores, since at the moment, the file only contains responses to individual items.

In a previous post, I noted that some items are reverse-scored. There are multiple ways I could go about reverse-scoring items and generating scores. Since one of those ways involves a package I plan to discuss more later, for the time being, I'll just write some of my own code to reverse-score. Since I'll be doing that with multiple variables, I'll write a custom function I can reuse.

reverse<-function(max,min,x) {
  y<-(max+min)-x
  return(y)
  }

I can then apply this function to the 3 CES-D items that need to be reverse-scored, providing the max and min rating values, as well as the value (x) I wanted reverse-scored, and create a new variable with "R" added to indicate it is reversed:


Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)
Facebook$Dep4R<-reverse(3,0,Facebook$Dep4)
Facebook$Dep8R<-reverse(3,0,Facebook$Dep8)
Facebook$Dep12R<-reverse(3,0,Facebook$Dep12)

Now I'll generate my scores. RRS doesn't have any reversed items, so I can just add those columns together; it does, however, have three subscales: Depression-Related Rumination (fixating on one's negative traits or feelings), Brooding (negative thoughts more generally), and Reflection (attempting to understand oneself and one's mood, which could be a positive experience). The CES-D has no subscales.


Facebook$RRS<-rowSums(Facebook[,3:24])
Facebook$RRS_D<-rowSums(Facebook[,c(3,4,5,6,8,10,11,16,19,20,21,24)])
Facebook$RRS_R<-rowSums(Facebook[,c(9,13,14,22,23)])
Facebook$RRS_B<-rowSums(Facebook[,c(7,12,15,17,18)])

Facebook$CESD<-rowSums(Facebook[,c(96,97,98,100,101,102,104,105,106,108,109,110,111,112,
                                   113,114)])

If you don't like scientific notation on your p-values (I don't), be sure to change those options before displaying results; I usually add this code at the beginning of every R session:


options(scipen=999)

I can use the RRS variables in my regression, though I wouldn't want to use the total RRS score in the same regression as the three subscales, since the subscales are derived from the total score. First, let's run a very simple regression with RRS total score and CES-D, which we can do with the lm (for "linear model") function:


RumDep<-lm(CESD~RRS, data=Facebook)
summary(RumDep)
## 
## Call:
## lm(formula = CESD ~ RRS, data = Facebook)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3020  -3.3885  -0.7835   2.4140  17.2783 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  8.45024    0.86992   9.714 <0.0000000000000002 ***
## RRS          0.19753    0.02132   9.264 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.107 on 255 degrees of freedom
## Multiple R-squared:  0.2518, Adjusted R-squared:  0.2489 
## F-statistic: 85.82 on 1 and 255 DF,  p-value: < 0.00000000000000022

Our regression results are significant, and this is a strong relationship; our R-squared (explained variance) is 0.25. But let's see what happens if we differentiate between the 3 types of rumination.


RumDep2<-lm(CESD~RRS_D+RRS_R+RRS_B, data=Facebook)
summary(RumDep2)
## 
## Call:
## lm(formula = CESD ~ RRS_D + RRS_R + RRS_B, data = Facebook)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.944  -3.308  -0.677   2.572  18.271 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  8.12981    0.86644   9.383 < 0.0000000000000002 ***
## RRS_D        0.36845    0.06312   5.838         0.0000000162 ***
## RRS_R        0.04613    0.09928   0.465                0.643    
## RRS_B       -0.05401    0.12766  -0.423                0.673    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.045 on 253 degrees of freedom
## Multiple R-squared:  0.2757, Adjusted R-squared:  0.2671 
## F-statistic:  32.1 on 3 and 253 DF,  p-value: < 0.00000000000000022

So the relationship between rumination and depression is driven specifically by self-focused rumination, rather than morose thoughts or simply reflecting on one's feelings. Differentiating between types of rumination also gives us a slightly higher R-squared. You may look at the regression coefficients (in the column called "Estimate") and notice that RRS_D is much larger than RRS_R or RRS_B. But remember these three subscales have different numbers of items and are therefore on different scales; I can't really compare them directly. In fact, you may have regression equations containing many predictors on different scales. What if more than one had been significant? How could I compare them?

I can standardize them. Standardized regression coefficients, or betas, are in standard deviation units or Z-scores. This allows you to directly compare your coefficients, because they describe the number of standard deviation units y will change for a 1 standard deviation change in x. To access standardized regression coefficients, we'll need to use another R package called QuantPsyc. Then we use a function in that package called lm.beta to access standardized coefficients for the linear model we ran. (This is why it's always good to name those results with name<- followed by the function; that way we can refer to them again later.)



library(QuantPsyc)
## Warning: package 'QuantPsyc' was built under R version 3.4.4
## Loading required package: boot
## Loading required package: MASS
## 
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
## 
##     norm
lm.beta(RumDep2)
##       RRS_D       RRS_R       RRS_B 
##  0.53245571  0.03262223 -0.03693183

Because units are standardized (Z-scores), the mean is 0, so we don't have a constant in this equation. These results tell us if our score on the Depression-Related Rumination is equal to 1 standard deviation, our CES-D score will equal 0.53 standard deviation units. In fact, by standardizing the coefficients, we see how much stronger the relationship between Depression-Related Rumination and Depression is than Reflecting and Depression or Brooding and Depression - 16.3 times and 14.4 times stronger, respectively. 

Just for fun, let's see what happens if we add in some additional variables, specifically personality, which was measured by a brief Big-Five measure. Unlike RRS and CES-D, which are scored by summing, the authors of the Ten Item Personality Measure have you average together the two items used to measure each of the 5 traits, after reverse-scoring half of the items. 



Facebook$CritR<-reverse(7,1,Facebook$Critical)
Facebook$AnxR<-reverse(7,1,Facebook$Anxious)
Facebook$ResR<-reverse(7,1,Facebook$Reserved)
Facebook$DisR<-reverse(7,1,Facebook$Disorganized)
Facebook$ConvR<-reverse(7,1,Facebook$Conventional)

Facebook$Extraversion<-(Facebook$Extraverted+Facebook$ResR)/2
Facebook$Agree<-(Facebook$CritR+Facebook$Sympathetic)/2
Facebook$Consc<-(Facebook$Dependable+Facebook$DisR)/2
Facebook$EmoSt<-(Facebook$AnxR+Facebook$Calm)/2
Facebook$Openness<-(Facebook$NewExperiences+Facebook$ConvR)/2

For simplicity, I'll just use total RRS score in this regression.


PersonalityDep<-lm(CESD~RRS+Extraversion+Agree+Consc+EmoSt+Openness, data=Facebook)
summary(PersonalityDep)
## 
## Call:
## lm(formula = CESD ~ RRS + Extraversion + Agree + Consc + EmoSt + 
##     Openness, data = Facebook)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4740  -3.4643  -0.7196   2.5938  17.7445 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  12.09086    3.20496   3.773             0.000202 ***
## RRS           0.19269    0.02148   8.972 < 0.0000000000000002 ***
## Extraversion -0.65750    0.43047  -1.527             0.127929    
## Agree         0.25896    0.38931   0.665             0.506557    
## Consc         0.49505    0.38714   1.279             0.202182    
## EmoSt         0.12558    0.39897   0.315             0.753197    
## Openness     -1.03245    0.38435  -2.686             0.007710 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.062 on 250 degrees of freedom
## Multiple R-squared:  0.2793, Adjusted R-squared:  0.262 
## F-statistic: 16.15 on 6 and 250 DF,  p-value: 0.000000000000001076

Other than rumination, only openness to experience significantly predicts depression scores, in this case negatively - people with higher scores on openness to experience have lower depression scores. Fortunately, the 5 personality variables are on the same scale, but they're on a different scale than the RRS. Let's request standardized coefficients so we can compare all of our predictor variables.


lm.beta(PersonalityDep)
##          RRS Extraversion        Agree        Consc        EmoSt 
##   0.48951737  -0.08595408   0.03780856   0.07145380   0.01756582 
##     Openness 
##  -0.14954644

The impact of personality variables were quite small, particularly compared to rumination. The effect of rumination on depression is 3.3 times stronger than the strongest personality trait, openness to experience. When I dropped rumination, only two predictors were significant: extraversion and openness to experience, and the relationships still weren't all that strong. Also, the R-squared was small - less than 0.05.


PersonalityDep2<-lm(CESD~Extraversion+Agree+Consc+EmoSt+Openness, data=Facebook)
summary(PersonalityDep2)
## 
## Call:
## lm(formula = CESD ~ Extraversion + Agree + Consc + EmoSt + Openness, 
##     data = Facebook)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.967  -3.746  -1.208   2.777  19.995 
## 
## Coefficients:
##              Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)   17.2529     3.6179   4.769 0.00000314 ***
## Extraversion  -1.0544     0.4913  -2.146     0.0328 *  
## Agree          0.6601     0.4438   1.487     0.1382    
## Consc          0.7030     0.4434   1.585     0.1141    
## EmoSt          0.4017     0.4564   0.880     0.3796    
## Openness      -1.0318     0.4410  -2.340     0.0201 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.809 on 251 degrees of freedom
## Multiple R-squared:  0.04726, Adjusted R-squared:  0.02828 
## F-statistic:  2.49 on 5 and 251 DF,  p-value: 0.03185

Take that, Google dude.

No comments:

Post a Comment