Sunday, April 1, 2018

Statistics Sunday: Cronbach's alpha

When developing a measure, there are two constructs that are very important: reliability and validity.

Validity, in this context, means the measure is measuring the right thing and not something else.

Reliability means the measure is consistent in measuring whatever it's measuring.

Obviously, having one does not automatically guarantee you'll have the other. Some measures that assess transient states can be highly valid, but because the thing they measure isn't consistent, the measures will appear to have lower reliability. And using shoe size to measure cognitive ability will have high reliability, even over time, but very low validity. So reliability refers to the consistency of the values while validity refers to the relationship between the values and the construct of interest.

There are different ways you can measure these two constructs. Today, I'll be focusing on reliability, and a specific measure of reliability: Cronbach's alpha. Look for my Blogging A to Z post today for how to compute Cronbach's alpha in R.

The type of reliability you want to measure affects the data you would collect. One type is test-retest reliability; if a person takes a measure on two separate occasions, how well do their scores match up? You would measure this by correlating scores from the first administration with scores from the second time. High test-retest reliability means scores are consistent across time. But, as I mention above, there might be situations where test-retest reliability doesn't make sense - that is, it may not be the type of reliability you're aiming for.

Other types of reliability can be measured with just one administration of a measure; these types of reliability are referred to as internal consistency. The simplest measure of internal consistency is split-half reliability. I literally divide my items in half and correlate scores on one half with scores on the other. I can split my items in a few ways - I could cut off at the halfway point, I could assign even-numbered items to one half and odd-numbered items to the other, or I could split items into halves at random.

Here's what those different kind of splits - halfway, even-odd, and random - might look like for a 6-item measure:


The way I split up my items will affect the overall correlation, though they should all be similar. But there still could be a fluke, where items that correlate poorly with the rest of the items just happen to be placed on the same half. One way you could correct for that is by running every possible split-half combination, then averaging those correlation results together.

That resulting average correlation of all possible split-halves is Cronbach's alpha.

When computing Cronbach's alpha, it's important to make sure items have the same meaning and there are no negative correlations. So if an item measures the reverse of a concept, you'd want to reverse score that item. For instance, say I created a measure of extraversion, where people rate their agreement with statements from 5, Strongly Agree to 1, Strongly Disagree. Since this is a measure of extraversion, we probably want higher scores to mean more extraversion.1 But say I have one item worded as:
  • I prefer to spend my free time alone. 
That item measures the opposite of extraversion. A person who strong disagrees with that item prefers not to spend free time alone, so I'd want to reverse-score that item, with Strongly Disagree being worth 5 points and Strongly Agree worth 1 point. Most software programs can do this kind of recode easily for you, so you shouldn't do it by hand. In fact, the R package I use for this measure will reverse items automatically when computing alpha, so you don't even need to create a new variable.

Because Cronbach's alpha is simply a correlation, and because you don't want any negative correlations (any items that negatively correlate with others should be reversed), Cronbach's alpha ranges from 0 to +1. Closer to 1 is better. There's some disagreement in the literature on how how high Cronbach's alpha needs to be. I usually use 0.8 as a cutoff - Cronbach's alpha below 0.8 suggests poor reliability - with 0.9 being optimal. Essentially, I consider:
  • ≥ 0.9 - excellent reliability
  • ≥ 0.8 but < 0.9 - acceptable reliability
  • < 0.8 - poor reliability
I've seen some people consider values as low as 0.7 acceptable, and for certain measures, that could possibly be the case; as I said, there is some disagreement in the literature on this issue. I wouldn't go any lower than 0.7, though. And if you must sacrifice on reliability, make certain you have really strong evidence for the validity of your measure.

When should you use Cronbach's alpha? 

Cronbach's alpha is a classical test theory approach to reliability, so it makes sense to use it when creating a measure using classical test theory. Item response theory and Rasch measures use a different kind of reliability measure. I have seen Cronbach's alpha used with measures developed using item response theory or Rasch - mainly when the measure uses a fixed form (all examinees receive the exact same items). When creating a measure, I want reliability to be as high as possible, and I'm hesitant to accept anything below 0.9. If my reliability is less than 0.9, I go back to my items and see which ones are poor performers and should potentially be dropped from the final measure.

For instance, I was recently handed a large dataset collected for my company by a marketing research firm. There were groups of items that were believed to assess the same thing, but they weren't developed as a measure (i.e., with psychometric methods and analysis) and the research group did their initial analysis using individual items. (Can you say "p-hacking"?) I did some principal components analyses to make sure all items measure the same thing or to see if there were subscales - this is a data-driven technique, but I had fewer cases than I would have liked for a confirmatory factor analysis. Instead, I adopted a hybrid theory-driven and data-driven approach: I examined items ahead of time and grouped them together as subscales, then confirmed that the PCA found similar results (they did). Then I examined Cronbach's alpha for the subscales that emerged, as further support that they could be grouped together.

You should also compute Cronbach's alpha when using a measure created with classical test theory, and you'll want to make sure the reliability of the measure in your sample is high and comparable to the established reliability from measurement development research. (This is why it's important to track down psychometric articles/reports on the measures you're using. But being a psychometrician, of course I would say that.) In research reports, I've frequently been asked by reviewers to include Cronbach's alpha for all measures I used in a study. When I'm simply using someone else's measure, I'm less worried about super-high reliability; as long as it's close to 0.8, I'm fine with it. Depending on the measure, I've been fine with an alpha of 0.78, for instance. If it's much lower than that, I am usually hesitant to include that measure in my analysis.

Check back later today for the post on conducting Cronbach's alpha in R! And be sure to stop by Deeply Trivial again - there will be statistics-related posts every day this month!

1One of the assumptions of Rasch and item response theory, the models I use in my psychometrics work, is that higher scores correspond to more of the construct being measured. This is how I approach measurement. Some researchers instead create measures where lower scores indicate more of the construct. There's nothing wrong with that approach if you're using classical test theory, but it feels backwards to me. Even when I use classical test theory, I use the higher scores equals more approach. You may prefer the opposite. The point is that all items need to be expressed in the same way - you don't want some items where higher scores equals more and others where lower scores equals more. So whichever method you adopt, make sure it's consistent and items that don't follow that method are reverse-scored.

3 comments:

  1. Coefficient Alpha is one of the most misunderstood and misused coefficients in psychology. Klaas Sijtsma wrote an excellent paper about it in 2009:

    Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2792363/.

    For further reading, Jessica Flake and I put together a section on reliability in our reading list "Measurement Matters":

    https://docs.google.com/document/d/11jyoXtO0m2lUywpC04KjLvI5QcBUY4YtwEvw6cg2cMs/

    ReplyDelete
  2. There are theoretical assumptions that need to be met when using this measure of reliability, and in most cases Omega is probably better measure. I don't think it's suitable to discuss this topic unless we also discuss concept of measurement models.

    Specifically, it is crucial to understand that when it's said that Alpha relies on classical test theory it's actually a specific model called tau-equivalence measurement model. This model has among other statistic assumptions an assumption of tau equivalence, or that every item could serve as an test of specific latent variable. With alpha it's necessary to check if every item has a similar variance and standard deviation; if not we may have broken the assumptions of the measurement model behind alpha...please don't use alpha coefficient as a "standard".

    ReplyDelete
    Replies
    1. Thanks for your detailed answer, Martin! I'll admit - omega is something I was only exposed to recently and don't know a lot about. My grad school training was predominantly classical test theory, with some reference to alpha but mostly validity and threats to validity. My post-doc is in psychometrics, where I learned Rasch and item response theory. I almost exclusively use Rasch these days, which uses a different measure of reliability. I'm planning to write posts about Rasch concepts in the future. In my previous job, where some of our exams were developed with classical test theory, we mostly used split-half reliability and sometimes alpha. Thanks again for sharing your thoughts and for stopping by!

      Delete