Monday, April 30, 2018

Z is for Z-Scores and Standardizing

Z is for Z-Scores and Standardizing Last April, I wrapped up the A to Z of Statistics with a post about Z-scores. It seems only fitting that I'm wrapping this April A to Z with the same topic. Z-scores are frequently used, sometimes when you don't even realize it. When you take your child to the doctor and they say he's at the x percentile on height, or when you take a standardized test and are told you scored in the y percentile, those values are being derived from Z-scores. You create a Z-score when you subtract the population mean from a value and divide that result by the population standard deviation.

Of course, we often will standardize variables in statistics, and the results are similar to Z-scores (though technically not the same if the mean and standard deviation aren't population values). In fact, when I demonstrated the GLM function earlier this month, I skipped a very important step when conducting an analysis with interactions. I should have standardized my continuous predictors first, which means subtracting the variable mean and dividing by the variable standard deviation, creating a new variable with a mean of 0 and a standard deviation of 1 (just like the normal distribution).

Let's revisit that GLM analysis. I was predicting verdict (guilty, not guilty) with binomial regression. I did one analysis where I used a handful of attitude items and the participant's guilt rating, and a second analysis where I created interactions between each attitude item and the guilt rating. The purpose was to see if an attitude impacts the threshold - how high the guilt rating needed to be before a participant selected "guilty". When working with interactions, the individual variables are highly correlated with the interaction variables based on them, so we solve that problem, and make our analysis and output a bit cleaner, by centering our variables and using those centered values to create interactions.

I'll go ahead and load my data. Also, since I know I have some missing values, which caused an error when I tried to work with predicted values and residuals Saturday, I'll also go ahead and identify that case/those cases.

dissertation<-read.delim("dissertation_data.txt",header=TRUE)
predictors<-c("obguilt","reasdoubt","bettertolet","libertyvorder",
              "jurevidence","guilt")
library(psych)
## Warning: package 'psych' was built under R version 3.4.4
describe(dissertation[predictors])
##               vars   n mean   sd median trimmed  mad min max range  skew
## obguilt          1 356 3.50 0.89      4    3.52 0.00   1   5     4 -0.50
## reasdoubt        2 356 2.59 1.51      2    2.68 1.48  -9   5    14 -3.63
## bettertolet      3 356 2.36 1.50      2    2.38 1.48  -9   5    14 -3.41
## libertyvorder    4 355 2.74 1.31      3    2.77 1.48  -9   5    14 -3.89
## jurevidence      5 356 2.54 1.63      2    2.66 1.48  -9   5    14 -3.76
## guilt            6 356 4.80 1.21      5    4.90 1.48   2   7     5 -0.59
##               kurtosis   se
## obguilt          -0.55 0.05
## reasdoubt        26.92 0.08
## bettertolet      25.47 0.08
## libertyvorder    34.58 0.07
## jurevidence      25.39 0.09
## guilt            -0.54 0.06
dissertation<-subset(dissertation, !is.na(libertyvorder))

R has a built-in function that will do this for you: scale. The scale function has 3 main arguments - the variable or variables to be scaled, and whether you want those variables centered (subtract mean) and/or scaled (divided by standard deviation). For regression with interactions, we want to both center and scale. For instance, to center and scale the guilt rating:

dissertation$guilt_c<-scale(dissertation$guilt, center=TRUE, scale=TRUE)

I have a set of predictors I want to do this to, so I want to apply a function across those specific columns:

dissertation[46:51]<-lapply(dissertation[predictors], function(x) {
  y<-scale(x, center=TRUE, scale=TRUE)
  }
)

Now, let's rerun that binomial regression, this time using the centered variables in the model.

pred_int<-'verdict ~ obguilt.1 + reasdoubt.1 + bettertolet.1 + libertyvorder.1 + 
                  jurevidence.1 + guilt.1 + obguilt.1*guilt.1 + reasdoubt.1*guilt.1 +
                  bettertolet.1*guilt.1 + libertyvorder.1*guilt.1 + jurevidence.1*guilt.1'
model2<-glm(pred_int, family="binomial", data=dissertation)
summary(model2)
## 
## Call:
## glm(formula = pred_int, family = "binomial", data = dissertation)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6101  -0.5432  -0.1289   0.6422   2.2805  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -0.47994    0.16264  -2.951  0.00317 ** 
## obguilt.1                0.25161    0.16158   1.557  0.11942    
## reasdoubt.1             -0.09230    0.20037  -0.461  0.64507    
## bettertolet.1           -0.22484    0.20340  -1.105  0.26899    
## libertyvorder.1          0.05825    0.21517   0.271  0.78660    
## jurevidence.1            0.07252    0.19376   0.374  0.70819    
## guilt.1                  2.31003    0.26867   8.598  < 2e-16 ***
## obguilt.1:guilt.1        0.14058    0.23411   0.600  0.54818    
## reasdoubt.1:guilt.1     -0.61724    0.29693  -2.079  0.03764 *  
## bettertolet.1:guilt.1    0.02579    0.30123   0.086  0.93178    
## libertyvorder.1:guilt.1 -0.27492    0.29355  -0.937  0.34899    
## jurevidence.1:guilt.1    0.27601    0.36181   0.763  0.44555    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.08  on 354  degrees of freedom
## Residual deviance: 300.66  on 343  degrees of freedom
## AIC: 324.66
## 
## Number of Fisher Scoring iterations: 6

The results are essentially the same; the constant and slopes of the predictors variables are different but the variables that were significant before still are. So standardizing doesn't change the results, but it is generally recommended because it makes results easier to interpret. The variables are centered around the mean, so negative numbers are below the mean, and positive numbers are above the mean.

Hard to believe A to Z is over! Don't worry, I'm going to keep blogging about statistics, R, and whatever strikes my fancy. I almost kept this post going by applying the prediction work from yesterday to the binomial model, but decided that would make for a fun future post. And I'll probably sprinkle in posts in the near future on topics I didn't have room for this month or that I promised to write a future post on. Thanks for reading and hope you keep stopping by, even though April A to Z is officially over!

No comments:

Post a Comment