Thursday, April 19, 2018

Q is for qplot

Q is for qplot You may have noticed that I frequently use the ggplot2 package and the ggplot function to produce graphics for my posts. ggplot2, which is part of the so-called tidyverse, is called gg to refer to the "grammar of graphics". That is, it uses standard functions and arguments to produce any number of graphics. You can change the appearance of these graphics by applying different settings. The nice thing about this type of syntax is that once you learn it for one type of graphic - say a histogram - it's very easy to expand out to other types of graphics - like scatterplots - without having to learn brand new functions. ggplot is a great way to create high-quality, publication-ready graphics.

But sometimes you don't need high-quality, publication-ready. Sometimes you just need a quick look at the data and you don't care if you have axis labels or centered titles. You just need to make certain there isn't anything wonky about your data as you clean and/or analyze. Fortunately, ggplot2 has a great function for that - qplot (or quick plot).

As with ggplot, qplot has a standard function and set of arguments, so once you learn to do it for one type of graphic, you can easily expand to others. And qplot has some smart rules built in to default to two of the most frequently used charts (particularly for quick looks at the data): histograms and scatterplots. Why are these most frequently used, especially in cleaning and early stages of analysis? A histogram lets you see if your variable is approximately normal; this is important because many statistical tests (and most of them you would have learned in an Introductory Statistics course) are built on the assumption that data are normally distributed. A scatterplot lets you see if your variables are related to each other, and whether that relationship is linear or not; once again, many statistical tests are built on assumptions about linear relationships between variables. So it makes sense that, if you're taking a quick look, you'll probably be using one of these two graphics.

The default graphics are very easy to produce: if you give only an x variable, you'll get a histogram, and if you give both x and y, you'll get a scatterplot. I'll use the Facebook data once again to demonstrate. I also went ahead and scored the RRS and SBI (described below) here - you can find code for scoring all measures here.

Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)
Facebook$RRS<-rowSums(Facebook[,3:24])
reverse<-function(max,min,x) {
  y<-(max+min)-x
  return(y)
}
Facebook$Sav2R<-reverse(7,1,Facebook$Sav2)
Facebook$Sav4R<-reverse(7,1,Facebook$Sav4)
Facebook$Sav6R<-reverse(7,1,Facebook$Sav6)
Facebook$Sav8R<-reverse(7,1,Facebook$Sav8)
Facebook$Sav10R<-reverse(7,1,Facebook$Sav10)
Facebook$Sav12R<-reverse(7,1,Facebook$Sav12)
Facebook$Sav14R<-reverse(7,1,Facebook$Sav14)
Facebook$Sav16R<-reverse(7,1,Facebook$Sav16)
Facebook$Sav18R<-reverse(7,1,Facebook$Sav18)
Facebook$Sav20R<-reverse(7,1,Facebook$Sav20)
Facebook$Sav22R<-reverse(7,1,Facebook$Sav22)
Facebook$Sav24R<-reverse(7,1,Facebook$Sav24)
Facebook$SBI<-Facebook$Sav2R+Facebook$Sav4R+Facebook$Sav6R+
  Facebook$Sav8R+Facebook$Sav10R+Facebook$Sav12R+Facebook$Sav14R+
  Facebook$Sav16R+Facebook$Sav18R+Facebook$Sav20R+Facebook$Sav22R+
  Facebook$Sav24R+Facebook$Sav1+Facebook$Sav3+Facebook$Sav5+
  Facebook$Sav7+Facebook$Sav9+Facebook$Sav11+Facebook$Sav13+Facebook$Sav15+
  Facebook$Sav17+Facebook$Sav19+Facebook$Sav21+Facebook$Sav23
library(ggplot2)

I'll use a scale I haven't really used in this series - the Savoring Beliefs Inventory. This measure was created by Fred Bryant, who was my faculty sponsor for this research (since I was still a grad student at the time). Fred also taught me structural equation modeling. The measure assesses a concept Fred calls savoring - fixating on positive events and feelings to retain those feelings of joy and pleasure. I selected this measure to include because, as I mentioned to Fred, I felt savoring was the opposite of rumination. (While he thought I'd made a good point, he told me he thought of savoring as the opposite of coping, which makes sense.)

Using the qplot function, we can quickly generate a histogram with total SBI score.

qplot(SBI, data=Facebook)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This variable shows a negative skew: there is a long tail (fewer cases than we'd expect if this followed the normal distribution) at the low end, the highest part of the distribution is to the right of center, and there is much less of a tail at the high end (more cases than we'd expect if this followed the normal distribution). We're also getting a message about bins. Right now, the histogram is slicing up the values between the minimum and the maximum into 30 bars. We can reduce this number to smooth out the distribution.

qplot(SBI, data=Facebook, bins=15)

This helps make the shape of the distribution more clear - we have a definite negative skew.

Now, regardless of whether the psychological opposite of savoring is coping or rumination, Fred and I both agreed that savoring and rumination would be negatively correlated. We can quickly demonstrate this, thanks to qplot with two variables.

qplot(SBI, RRS, data=Facebook)
cor(Facebook$SBI, Facebook$RRS)
## [1] -0.3510101

I also requested the correlation coefficient for these two variables, which is -0.35. This is a moderate negative correlation.

There are many other things you can do with qplot. First, you could generate separate graphics for groups, using facets.

Facebook$gender<-factor(Facebook$gender, labels=c("Male","Female"))
qplot(SBI, RRS, data=Facebook, facets=~gender)

Alternatively, we could have men and women points displayed on the chart in different colors.

qplot(SBI, RRS, data=Facebook, colour=gender)

You can also change the type of graphic with geom.

qplot(gender, data=Facebook, geom="bar")

There are other things you can do, such as manually set the limits for the x- or y-axes (with xlim=c(min,max) or ylim=c(min,max)), or log transform one or both variables. To demonstrate the latter, I'll bring back in my power analysis dataset - projected sample sizes for proportion comparisons with power of 0.8 or 0.9. I could have log-transformed the sample size variable.

library(pwr)
## Warning: package 'pwr' was built under R version 3.4.4
p1s <- seq(.5,.79,.01)
h <- ES.h(p1 = p1s, p2 = 0.80)
nh <- length(h)
p <- c(.8,.9)
np <- length(p)
sizes <- array(numeric(nh*np), dim=c(nh,np))
for (i in 1:np){
   for (j in 1:nh){
       pow_an <- pwr.p.test(n = NULL, h = h[j],
                            sig.level = .05, power = p[i],
                            alternative = "less")
       sizes[j,i] <- ceiling(pow_an$n)
   }
}
samp <- data.frame(cbind(p1s,sizes))
colnames(samp)<-c("Pass_Rate","Power.8", "Power.9")
qplot(Pass_Rate, Power.9, data=samp, log="y", ylab="Log-Transformed Sample Size")

As you can see, I can add custom labels - by default, it displays variable names, which is fine for a quick look. But I wanted to specifically call-out that y was log-transformed so my little brain didn't get confused if I had to walk away and come back.

Tomorrow will be a code-free post giving a short history of R. And back to coding posts Saturday!

No comments:

Post a Comment