Tuesday, April 3, 2018

C is for Cross-Tabs Analysis

Title Cross-tabs is short for cross-tabulation (or cross tables), and are used to display frequencies for combinations of categorical and sometimes ordinal variables. In a previous blog post, I described chi-square, which is frequently used to analyze cross-tabs. As I said in that post, chi-square is a little like an ANOVA in how it examines the relationship between the two variables, but it is used to look at an association between two variables with two (and sometimes more) levels. Basically, you would use a non-parametric test like chi-square when working with variables that aren't continuous or that violate key assumptions of parametric tests.

Today, I'll demonstrate how to generate cross-tabs in R (which I also did in the previous post) and I'll show two ways to analyze cross-tabs: chi-square (again) and a similar test, Fisher's exact test.

Once again, I'll use my Facebook dataset to demonstrate. None of the variables in the set I used previously would really qualify for cross-tabs, since the variables in that dataset are meant to be combined into continuous scales. I included gender, but because of the student body makeup at the school where I collected my data, there are a lot of women and very few men - which isn't optimal for this type of analysis. But I can pull in some additional data collected in the demographics portion of the survey. The goal of the study was to examine Facebook usage patterns, so participants reported whether they used Facebook. The vast majority of the sample did. But they also reported use of other social media, including Twitter and LinkedIn.

A bit more than half the sample were freshmen, since the participant pool is drawn from Introductory Psychology, mostly taken by freshmen. The remaining participants were upper-class or non-traditional students. It makes sense that these two groups might have different usage patterns. That is, older students might be more likely to use LinkedIn, as they prepare for job searches and networking; non-traditional students are likely to already have a job or career. It's unclear, however, whether we might see a similar difference for Twitter users - though keep in mind, these data were collected in 2010, when Twitter may have been a different landscape. So let's generate two cross-tabs, both using the freshmen versus upper-class/non-traditional students (or younger versus older, for simplicity), one to look at Twitter use and one to look at LinkedIn use.

First, I'll read in that data, then redefine the variables I need as factors, which includes age2 (recoded from the continuous age variable) and indicators for using Twitter and LinkedIn. This gives labels to my cross-tabs. In order to make changes to these variables, I have to refer to these variables in my code, first to reflect that I want to change that variable (the information before the <-) and again when I reference what variable to make a factor. I use the dataset$variablename syntax to refer to a specific variable:

age_socialmedia<-read.delim(file="age_usage.txt", header=TRUE)
age_socialmedia$age2<-factor(age_socialmedia$age2, labels=c("Younger","Older"))
age_socialmedia$Twitter<-factor(age_socialmedia$Twitter, labels=c("Non-User","User"))
age_socialmedia$LinkedIn<-factor(age_socialmedia$LinkedIn, labels=c("Non-User","User"))

If I had wanted, I could have created a new variable in the first part of the code. If I wrote the name of a variable that doesn't exist in the dataset, it would be added. But since I'm not recoding, just adding labels, I have no issue with overwriting the existing variable.

I can generate my tables with the following:

Twitter<-table(age_socialmedia$age2, age_socialmedia$Twitter)
##           Non-User User
##   Younger      111   29
##   Older         91   25
LinkedIn<-table(age_socialmedia$age2, age_socialmedia$LinkedIn)
##           Non-User User
##   Younger      139    1
##   Older        109    7

Use of either social media site is not very high in this sample, but much too low for LinkedIn to use chi-square - one of the assumptions of that test is that no cells have counts less than 5. Fisher's exact test will work in that situation, though, so we can use chi-square for Twitter and Fisher's exact test for LinkedIn.

The code for either is very easy, especially if you named your tables:

##  Pearson's Chi-squared test with Yates' continuity correction
## data:  Twitter
## X-squared = 9.2489e-05, df = 1, p-value = 0.9923

The Yates' continuity correction was developed because chi-square is biased to be significant when samples are large. We can easily turn this feature off, and most people do:

chisq.test(Twitter, correct=FALSE)
##  Pearson's Chi-squared test
## data:  Twitter
## X-squared = 0.026729, df = 1, p-value = 0.8701

In this case, the correction made little difference - yes, the p-value is smaller but neither is even close to being significant. So we'd conclude that, in this sample, there is no age difference in Twitter use.

Now let's conduct our Fisher's exact test, which as I mention above, can be used when there are cells with counts less than 5:

##  Fisher's Exact Test for Count Data
## data:  LinkedIn
## p-value = 0.02477
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    1.112105 404.350276
## sample estimates:
## odds ratio 
##   8.863863

This test is significant; LinkedIn users tend to be older in this sample. In fact, older students are over 8 times more likely to be LinkedIn users than younger students in the present sample.

No comments:

Post a Comment