Today, I'll demonstrate how to generate cross-tabs in R (which I also did in the previous post) and I'll show two ways to analyze cross-tabs: chi-square (again) and a similar test, Fisher's exact test.
Once again, I'll use my Facebook dataset to demonstrate. None of the variables in the set I used previously would really qualify for cross-tabs, since the variables in that dataset are meant to be combined into continuous scales. I included gender, but because of the student body makeup at the school where I collected my data, there are a lot of women and very few men - which isn't optimal for this type of analysis. But I can pull in some additional data collected in the demographics portion of the survey. The goal of the study was to examine Facebook usage patterns, so participants reported whether they used Facebook. The vast majority of the sample did. But they also reported use of other social media, including Twitter and LinkedIn.
A bit more than half the sample were freshmen, since the participant pool is drawn from Introductory Psychology, mostly taken by freshmen. The remaining participants were upper-class or non-traditional students. It makes sense that these two groups might have different usage patterns. That is, older students might be more likely to use LinkedIn, as they prepare for job searches and networking; non-traditional students are likely to already have a job or career. It's unclear, however, whether we might see a similar difference for Twitter users - though keep in mind, these data were collected in 2010, when Twitter may have been a different landscape. So let's generate two cross-tabs, both using the freshmen versus upper-class/non-traditional students (or younger versus older, for simplicity), one to look at Twitter use and one to look at LinkedIn use.
First, I'll read in that data, then redefine the variables I need as factors, which includes age2 (recoded from the continuous age variable) and indicators for using Twitter and LinkedIn. This gives labels to my cross-tabs. In order to make changes to these variables, I have to refer to these variables in my code, first to reflect that I want to change that variable (the information before the <-) and again when I reference what variable to make a factor. I use the dataset$variablename syntax to refer to a specific variable:
age_socialmedia<-read.delim(file="age_usage.txt", header=TRUE) age_socialmedia$age2<-factor(age_socialmedia$age2, labels=c("Younger","Older")) age_socialmedia$Twitter<-factor(age_socialmedia$Twitter, labels=c("Non-User","User")) age_socialmedia$LinkedIn<-factor(age_socialmedia$LinkedIn, labels=c("Non-User","User"))
If I had wanted, I could have created a new variable in the first part of the code. If I wrote the name of a variable that doesn't exist in the dataset, it would be added. But since I'm not recoding, just adding labels, I have no issue with overwriting the existing variable.
I can generate my tables with the following:
Twitter<-table(age_socialmedia$age2, age_socialmedia$Twitter) Twitter
## ## Non-User User ## Younger 111 29 ## Older 91 25
LinkedIn<-table(age_socialmedia$age2, age_socialmedia$LinkedIn) LinkedIn
## ## Non-User User ## Younger 139 1 ## Older 109 7
Use of either social media site is not very high in this sample, but much too low for LinkedIn to use chi-square - one of the assumptions of that test is that no cells have counts less than 5. Fisher's exact test will work in that situation, though, so we can use chi-square for Twitter and Fisher's exact test for LinkedIn.
The code for either is very easy, especially if you named your tables:
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: Twitter ## X-squared = 9.2489e-05, df = 1, p-value = 0.9923
The Yates' continuity correction was developed because chi-square is biased to be significant when samples are large. We can easily turn this feature off, and most people do:
## ## Pearson's Chi-squared test ## ## data: Twitter ## X-squared = 0.026729, df = 1, p-value = 0.8701
In this case, the correction made little difference - yes, the p-value is smaller but neither is even close to being significant. So we'd conclude that, in this sample, there is no age difference in Twitter use.
Now let's conduct our Fisher's exact test, which as I mention above, can be used when there are cells with counts less than 5:
## ## Fisher's Exact Test for Count Data ## ## data: LinkedIn ## p-value = 0.02477 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 1.112105 404.350276 ## sample estimates: ## odds ratio ## 8.863863
This test is significant; LinkedIn users tend to be older in this sample. In fact, older students are over 8 times more likely to be LinkedIn users than younger students in the present sample.