Research is all about measuring things. Sometimes, the only way to measure a certain concept is by having people watch and "code" the behavior, or perhaps read something and code its contents. You want to make sure the coding scheme you're using is clear, leaving little room for subjectivity or "judgement calls." So the best thing to do is to have at least two people code each case, then measure what we call inter-rater reliability.
Think of reliability as consistency. When a measure is reliable, that means it is consistent in some way - across time, across people, etc. - depending on the type of reliability you measure. So if you give the same person a measure at two different time points, and measure how similar they are, you're measuring reliability across time (what we call test-retest reliability). Inter-rater reliability means you're measuring the consistency between/across raters or judges.
Let's say, as I was conducting my caffeine study, I decided to enlist two judges who would watch my participants and code signs of sleepiness. I could have them check whether a certain participant yawned or rubbed their eyes or engaged in any behaviors that might suggest they aren't very alert. And for the sake of argument, let's say because I selected so many different behaviors to observe, I had my raters simply check whether a certain behavior happened at all during the testing session.
This really isn't how you would do any of it. Instead, I would select a small number of clear behaviors, video-tape the session so coders can watch multiple times, and have them do counts rather than yes/no. But you wouldn't use Cohen's kappa for counts.
I also don't do observational coding in my research. But I digress.
After the session, I would want to compare what each judge saw, and make sure they agreed with each other. The simplest inter-rater reliability is just percent agreement - what percent of the time did they get the same thing? But if the raters weren't actually paying attention and just checked boxes at random, we know that by chance alone, they're going to agree with each other at some point in the coding scheme; after all, a stopped clock is still right twice a day. So Cohen's kappa controls for the how often we would expect people to agree with each other just by chance, and measures how much two coders agree with each other above and beyond that:
Cohen's kappa = 1 - ((1-Percent agreement)/(1-Probability of chance agreement))
The probability of chance agreement is computed based on the number of categories and the percentage of time each rater used a given category. For the sake of brevity, I won't go into it here. The Wikipedia page for Cohen's kappa provides the equation and gives a really useful example. I've used Cohen's kappa for some of my research and created an Excel template for computing Cohen's kappa (and I'm 85% certain I know where it is right now).
You would compute Cohen's kappa for each item (e.g., yawn, rub eyes, etc.). You want your Cohen's kappa to be as close to 1 as possible. When you write up your results, you could report each Cohen's kappa by item (if you don't have a lot of items), or a range (lowest to highest) and average.
So, readers, how would you rate this post?