Of course, we often will standardize variables in statistics, and the results are similar to Z-scores (though technically not the same if the mean and standard deviation aren't population values). In fact, when I demonstrated the GLM function earlier this month, I skipped a very important step when conducting an analysis with interactions. I should have standardized my continuous predictors first, which means subtracting the variable mean and dividing by the variable standard deviation, creating a new variable with a mean of 0 and a standard deviation of 1 (just like the normal distribution).
Let's revisit that GLM analysis. I was predicting verdict (guilty, not guilty) with binomial regression. I did one analysis where I used a handful of attitude items and the participant's guilt rating, and a second analysis where I created interactions between each attitude item and the guilt rating. The purpose was to see if an attitude impacts the threshold - how high the guilt rating needed to be before a participant selected "guilty". When working with interactions, the individual variables are highly correlated with the interaction variables based on them, so we solve that problem, and make our analysis and output a bit cleaner, by centering our variables and using those centered values to create interactions.
I'll go ahead and load my data. Also, since I know I have some missing values, which caused an error when I tried to work with predicted values and residuals Saturday, I'll also go ahead and identify that case/those cases.
dissertation<-read.delim("dissertation_data.txt",header=TRUE) predictors<-c("obguilt","reasdoubt","bettertolet","libertyvorder", "jurevidence","guilt") library(psych)
## Warning: package 'psych' was built under R version 3.4.4
describe(dissertation[predictors])
## vars n mean sd median trimmed mad min max range skew ## obguilt 1 356 3.50 0.89 4 3.52 0.00 1 5 4 -0.50 ## reasdoubt 2 356 2.59 1.51 2 2.68 1.48 -9 5 14 -3.63 ## bettertolet 3 356 2.36 1.50 2 2.38 1.48 -9 5 14 -3.41 ## libertyvorder 4 355 2.74 1.31 3 2.77 1.48 -9 5 14 -3.89 ## jurevidence 5 356 2.54 1.63 2 2.66 1.48 -9 5 14 -3.76 ## guilt 6 356 4.80 1.21 5 4.90 1.48 2 7 5 -0.59 ## kurtosis se ## obguilt -0.55 0.05 ## reasdoubt 26.92 0.08 ## bettertolet 25.47 0.08 ## libertyvorder 34.58 0.07 ## jurevidence 25.39 0.09 ## guilt -0.54 0.06
dissertation<-subset(dissertation, !is.na(libertyvorder))
R has a built-in function that will do this for you: scale. The scale function has 3 main arguments - the variable or variables to be scaled, and whether you want those variables centered (subtract mean) and/or scaled (divided by standard deviation). For regression with interactions, we want to both center and scale. For instance, to center and scale the guilt rating:
dissertation$guilt_c<-scale(dissertation$guilt, center=TRUE, scale=TRUE)
I have a set of predictors I want to do this to, so I want to apply a function across those specific columns:
dissertation[46:51]<-lapply(dissertation[predictors], function(x) { y<-scale(x, center=TRUE, scale=TRUE) } )
Now, let's rerun that binomial regression, this time using the centered variables in the model.
pred_int<-'verdict ~ obguilt.1 + reasdoubt.1 + bettertolet.1 + libertyvorder.1 + jurevidence.1 + guilt.1 + obguilt.1*guilt.1 + reasdoubt.1*guilt.1 + bettertolet.1*guilt.1 + libertyvorder.1*guilt.1 + jurevidence.1*guilt.1' model2<-glm(pred_int, family="binomial", data=dissertation) summary(model2)
## ## Call: ## glm(formula = pred_int, family = "binomial", data = dissertation) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6101 -0.5432 -0.1289 0.6422 2.2805 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.47994 0.16264 -2.951 0.00317 ** ## obguilt.1 0.25161 0.16158 1.557 0.11942 ## reasdoubt.1 -0.09230 0.20037 -0.461 0.64507 ## bettertolet.1 -0.22484 0.20340 -1.105 0.26899 ## libertyvorder.1 0.05825 0.21517 0.271 0.78660 ## jurevidence.1 0.07252 0.19376 0.374 0.70819 ## guilt.1 2.31003 0.26867 8.598 < 2e-16 *** ## obguilt.1:guilt.1 0.14058 0.23411 0.600 0.54818 ## reasdoubt.1:guilt.1 -0.61724 0.29693 -2.079 0.03764 * ## bettertolet.1:guilt.1 0.02579 0.30123 0.086 0.93178 ## libertyvorder.1:guilt.1 -0.27492 0.29355 -0.937 0.34899 ## jurevidence.1:guilt.1 0.27601 0.36181 0.763 0.44555 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 490.08 on 354 degrees of freedom ## Residual deviance: 300.66 on 343 degrees of freedom ## AIC: 324.66 ## ## Number of Fisher Scoring iterations: 6
The results are essentially the same; the constant and slopes of the predictors variables are different but the variables that were significant before still are. So standardizing doesn't change the results, but it is generally recommended because it makes results easier to interpret. The variables are centered around the mean, so negative numbers are below the mean, and positive numbers are above the mean.
Hard to believe A to Z is over! Don't worry, I'm going to keep blogging about statistics, R, and whatever strikes my fancy. I almost kept this post going by applying the prediction work from yesterday to the binomial model, but decided that would make for a fun future post. And I'll probably sprinkle in posts in the near future on topics I didn't have room for this month or that I promised to write a future post on. Thanks for reading and hope you keep stopping by, even though April A to Z is officially over!
No comments:
Post a Comment