I fell a bit behind in my schedule. Friday was spent volunteering for my choir and attending a benefit, and Saturday was spent recovering from the night before and attending a movie night with friends. But I'll have some great videos to share of the event soon!
One reason I didn't write my post on Friday was because the topic I was originally going to write about (factorials) I covered last year. So I needed to make time to sit down and think of a topic, then write it as well, and that sadly didn't happen. But I remembered many of the contributions of Ronald Fisher, and decided it might be nice to feature a statistician in one of my posts.
Fisher was an English statistician and biologist. Many of his contributions are to evolutionary theory, genetics, and related topics, but he also gave us many statistics and methods. Probably the best known contribution is the Analysis of Variance (ANOVA). I'll get into the mechanics of statistical analysis later, which will help you understand what's going on when we run an ANOVA. But this extremely useful test lets you compare means from more than 2 groups.
Using my caffeine study example, I could have 3 groups instead of two: control (no caffeine), experimental 1 (receives 95 mg of caffeine, the equivalent of one cup of coffee), and experimental 2 (receives 190 mg, 2 cups of coffee). I could then compare the test scores across the 3 groups. This would let me explore many possible outcomes, and in fact, I could add a few more experimental groups, where I give different amounts of caffeine. It could be that more caffeine is better. Or it could improve performance up to a point, and then flatten out. We could even find a point where more caffeine is actually harmful. ANOVA is great when you have multiple (but not a huge number of) study groups, each of which receives a specific intervention or treatment. If you have more than a handful of groups, there are other better analyses you could use.
Even though Fisher named this analysis after what it does (it analyzes variance and determines what variance is due to the treatment and what variance is error), he did sort of name the test after himself. The statistic ANOVA produces is F (for Fisher), and it is compared to the F distribution critical value to determine if the test is significant (there is a real difference, larger than what we would expect by chance alone). You might even hear people refer to the F test - they're talking about ANOVA.
He also gave us the Fisher exact test, which I use less often than ANOVA but have still used. It lets test whether two proportions are equal to each other. At my old job, we used Fisher exact tests quite a bit to compare our sample characteristics to population characteristics. That is, we would do surveys of VA patients, and then make sure our sample represented that population well in terms of gender, race, ethnicity, and so on. The assumption is that, if the sample is similar to the population in characteristics, then it should provide similar answers to the survey questions (the topic of study) as the population would have.
Fisher also gave us maximum likelihood estimation (MLE), which is getting into more advanced statistical topics. I'll be honest - I understand it, but not well enough to actually teach it to someone. But I knew I would be remiss if I didn't at least mention MLE.
Yesterday, I talked about descriptive statistics, including measures of central tendency (such as the mean) and measures of variability (such as standard deviation). Let's say I give you the mean and standard deviation for the test I used in my caffeine study. If I point at a particular person in that sample and ask you to guess what score they got on the test, your best option would be the mean. That's what we use as the "typical" score. It's probably going to be the wrong guess, but in the absence of any additional information, it's your best bet.
Scores vary, of course. It's unlikely that two people in your sample will get the same score, let alone everyone in the sample. The point of statistics is find variables that explain that variance. The mean is the representative score for any individual person in the sample. So unless and until we find that variable that explains why scores differ from each other, we have to assume that any variation in scores (any deviation from the mean, our typical score) is a mistake. We call that error.
Any variation is error until we can explain it.
Obviously, this isn't actually true. People are different from each other, with different abilities, and there are logical explanations for why one person might get a perfect score on a test while another might get 50%. And it isn't just underlying ability that might affect how a person performs on a test. A host of environmental factors might also influence their scores. But statistically, we have to think of any variation in scores that we can't explain as error. Our inferential statistics are used to explain that variation - to take some of what we call error and relabel it as systematic, as having a cause. We can't measure everything, so in any study, we're going to have leftover variance that we simply call error.
So statistics is really about taking variation in scores and moving as much as of it as we can out of the error column and into the systematic (explained) variance column. If I can show that some of the variance is due to whether a person was allowed to drink caffeinated coffee before taking a test, I have found some evidence to support my hypothesis and have been able to move some of the variation into the explained column.
How exactly do we go about partitioning out this variance? Stay tuned! I'll get to that in a future post.
You're out for coffee with a friend. You start talking about a new book you just finished. Your friend asks you to tell her about the book. Do you a) pull the book out of your purse and start reading it to her or b) give her a brief summary?
Yes, I know that's a silly question. Obviously it's a.
Kidding. Of course you would give her a summary. No friend - no matter how patient - is going to sit there and listen to you read her the entire book, even a short one. That's simply too much information to answer her question.
That's essentially what statistics are: a way of summarizing large amounts of information, to give people the most important pieces. In fact, many people divide statistics up into two types. The first, which I'll talk about in more detail today, is descriptive statistics. The second is inferential statistics, which I'll talk more about later, but the purpose is to explore relationships and find explanations for different effects.
Descriptive statistics, as the name implies, describe your data. That is, they are used to quickly summarize the most important pieces of information about the variables you measured. You've probably encountered many of these statistics.
In fact, because we statisticians love counting and subdividing things, we tend to divide descriptive statistics up into two types. First are measures of central tendency, a fancy way of saying that you're describing typical case. The measure of central tendency you know best is the average, or the mean as it's called in statistics. You get it by adding all of the scores up and dividing by the number of scores. In our ongoing caffeine study, we would probably report the average score for our sample, as well as by group (experimental or control). That tells us a lot about what we need to know about our sample.
Average isn't always the best measure of central tendency to report. What if your data are in the form of categories, like gender? There isn't an average gender, but you could still report the proportions of men and women. And you could tell us the typical gender of your sample by reporting which gender is more frequent. This is called the mode: the most frequently occurring category.
The last measure of central tendency is the median, which divides your distribution in half. Basically, if you were to line up all your scores in numerical order, it would be the score in the very middle. This measure is best used when your data has outliers, scores that are so far away from the rest of your group that including them in the average would skew your results. Think of it as trying to report the average salary of a group of people when you have one billionaire in the bunch. The average wouldn't make sense because it's not going to represent anyone in your group well. The median reduces the influence of very high or very low scores on the ends.
So let's say the average test score for our caffeine study sample is 85%. This tells you the typical person got a B, but doesn't tell you everything you need to know. It could be that everyone got a B. It could be that half of the people got an A+ and the other half got a C-. Or there could be grades ranging all the way from A+ to F. You want to know how spread out the scores are. For this information, we use measures of variability.
The first measure of variability is the easiest: range, which gives the highest and lowest scores. It's also the least useful. If you think back to the billionaire example above, one outlier can make your scores look a lot more spread out than they actually are.
That's why we have two other measures of variability, which I'll talk about together (and you'll see why in a moment). The first is variance. You need variance to compute the other measure, standard deviation. Standard deviation tells you how much your scores deviate from the mean on average. So the first step in computing standard deviation is to take each score in the sample and subtract the mean from it. But you can't just add all those deviations up and divide by the number of scores. The mean is a balancing point; it's designed to give you the approximate center. Some deviations will be positive (higher than the mean), and some will be negative (lower than the mean). If you add all of the deviations together, they'll probably add up to 0 (or pretty close). So after you compute your deviations, you square them so they're all positive. Those squared deviations are used to compute your variance, the average squared deviation from the mean. Then you take the square root of the variance to get standard deviation.
Do you see why I talked about them together? Variance is important and it's used in a lot of analyses, but as a measure of variability, people tend to rely on standard deviation. Squared scores and deviations are just a little harder to wrap our heads around.
So now, if I give you the average test score and a standard deviation, you've got a better idea of what the distribution of scores looks like. If the average score is 85 and the standard deviation is 2, you know that this is a moderately easy test for most people (just about everyone got the same score). But if the average score is 85 and the standard deviation is 20, you know the test ranges from super easy to super challenging for different people.
I'll talk about our good friend the normal distribution in one of these posts, and then you'll really see how beneficial standard deviation is. Because if something is normally distributed and you know the standard deviation, you can very easily figure out what percentages of scores fall at certain values.
As I said in my previous post, you can use control variables to get a clearer view of the effect of your independent variable on your dependent variable, by factoring out other noise. But control variables can also be used to make an even stronger case for a particular cause and effect relationship by showing that other explanations are incorrect.
Case in point:
Statistically, the same thing is happening in either case. You're taking variables that may affect your outcome, but aren't the variables you're interested in studying, and factoring them out to get at the true relationship between your independent variable (or in this case, since we're looking at gender, which you can't manipulate, we would probably say predictor variable) and your dependent variable (wages). But in this case, you're seeing if the relationship goes away when you account for the alternative explanations. That is, if gender didn't directly affect wages, and it's instead because of differences in education, field, and so on, the gender difference would go away when these other variables are statistically controlled for. Even after controlling for these alternative explanations, women still have lower wages than men.
Yesterday, I talked about beta/Type II error and described things we can do in our study to decrease beta. The things I described were more about study methods though and making certain you're conducting a rigorous study to reduces bias/outside influences. While in many research fields, we discuss methods and statistics separately, they're connected, and methods can and do impact your statistical analyses.
The cleanest study is one where you can control your participants' entire situation - what they see, what they have access to, even what they've seen in the past. This is impossible, of course. But we do our best to recreate this perfect situation and (when it's possible) count on random assignment of participants to groups to even out any pre-existing differences. Once again, because probability. It's unlikely that every person who would react as you expected ends up in one group (say your experimental group) while every person who would not have reacted as you expected end up in another; the chances of that happening are small (but remember, not 0 - it's completely possible, though unlikely, to flip a coin and get 100 heads in a row).
Weird things happen, because probability, and we also cannot control everything. We certainly can't randomize everything. There are some conditions it would unethical or impossible to manipulate in our study. In those cases, we can still take them into account by using them as control variables. Control variables are variables that we think will affect our outcome (dependent variable), but are not the actual thing we're studying (independent variable). We deal with them by measuring them and using them in our analysis. I'll go into more detail later on about exactly what's happening when we use control variables in our analysis, but essentially we're factoring that information out, so we can get a clean look at the relationship between the variables we're interested in (independent and dependent variables).
In my caffeine study, I might want to use some pre-existing information about my participants as control variables. Characteristics, like gender, which might affect their reaction to caffeine are possibilities. Another is how much caffeine they ingest regularly. I wouldn't want them to have any caffeine before my study, and if a participant showed up saying he or she had had caffeine that day, I would probably throw their data out. But I can't control how much caffeine they consume before they even signed up for my study. If they drink a lot of coffee each day, they might react to the treatment differently than a person who drinks very little. I can collect that information from a short questionnaire, then use that as a control variable in my analysis.
So we're moving forward with blogging A to Z. Today's topic is really following up from Saturday, where we talked about alpha. Perhaps you were wondering, "If alpha is equal to Type I error, where you say there's something there when there really isn't, does that mean there's also a chance you'll say there's nothing there when there really is?" Yes, and that's what we're talking about today: Beta. Also known as your Type II error rate.
Continuing with the horror movie analogy, this is when you walk in the room and walk right by the killer hiding in the corner without even seeing him. He was right there and you missed him! Once again, you don't ever get any feedback that you missed it by ending up dead later on. So it won't necessarily kill you to commit a Type II error but at the very least, you're missing out on information that might be helpful.
Unlike alpha, which we set ahead of time, we can only approximate beta. We can also do things to minimize beta: use really strong interventions (dosage), minimize any diffusion of the treatment into our control group, and select good measures, to name a few. In fact, if you want to learn more, check out this article on internal validity, which identifies some of the various threats to internal validity (the ability of our study to show that our independent variable causes our dependent variable).
To use my caffeine study example from Saturday, I would want to use a strong dose of caffeine with my experimental group. I would also make sure they haven't had any caffeine before they came in to do the study, and if I can't control that, I would at least want to measure it. I would also probably keep my experimental and control groups separate from each other, to keep them from getting wise to the differences in the coffee. And I would want to use a test that my participants have not taken before.
There's also a way you can minimize beta directly, by maximizing its inverse: 1-β or Power. Power is the probability that you will find an effect if it is there. We usually want that value to be at least 0.8, meaning 80% probability that you will find an effect if there's one to be found. If you know something about the thing you're studying - that is, other studies have already been performed - you can use the results of those studies to estimate the size of the effect you'll probably find in your study. In my caffeine study, the effect I'm looking for is the difference between the experimental and control group, in this case a difference between two means (averages). There are different metrics (that I'll hopefully get to this month) that reflect the magnitude of the difference between two groups, metrics that take into account not only the difference between the two means but also how spread out the scores are in the groups.
Using that information from previous studies, I can then do what's called a power analysis. If you're doing a power analysis before you conduct a study (what we would call a priori), you'll probably use that power analysis to tell you how many people you should have in your study. Obviously, having more people in your study is better because more people will get you close to the population value you're trying to estimate (don't worry, I'll go into more detail about this aspect later). But you can't get everyone in your study, nor would you want to spend the time and money to keep collecting data - studies would never end! So an a priori power analysis helps you figure out what resources you need while also helping you feel confident that, if there's an effect to be found, you'll find it.
Of course, you might be studying something completely new. For instance, when I studied Facebook use, rumination, and health outcomes, there was very little research on these different relationships - there's more now, of course. What I did in those cases was to pick the smallest effect I was interested in seeing. Basically, what is the smallest effect that is just big enough to be meaningful or important? In this case, we're not only using statistical information to make a decision; we're also potentially using clinical judgment.
For instance, one of my health measures was depression: how big of a difference in depression is enough for us to say, "This is important"? If we see that using Facebook increases depression scores by that much, then we have a legitimate reason to be concerned. That's what I did for each of the outcomes, when I didn't have any information from previous studies to guide me. Then I used that information to help me figure out how many people I needed for my study.
Power analysis sounds intimidating, but it actually isn't. A really simple power analysis can be done using a few tables. (Yep, other people have done the math for you.) More complex power analyses (when you're using more complex statistical analysis) can be conducted with software. And there are lots of people out there willing to help you. Besides, isn't doing something slightly intimidating better than doing a study without actually knowing you're going to find anything?
So for today, the letter A, I thought I'd start at the very beginning because Julie Andrews says it's a very good place to start. And that's with alpha, the first letter of the Greek alphabet.
Alpha is a really important concept in statistics. Statistics is all about probability. I know that sounds kind of obvious. If you've taken a stats class or even a section on statistics in another math class, they probably spent a lot of time talking to you about probability. In the past, they did that hands in poker. The downfall of teaching it that way is you have to teach people poker so they understand what hands are better than other hands. I guess that makes dorm parties a little more interesting. More recently, I've seen them use combinations of dice or coin flips. But I don't think that people understand the connection of why is it that we're really hammering down this concept of probability, and the probability of different combinations, or that some outcomes are more or less likely than others.
That's because any statistical inference we make - and by statistical inference I mean any decision we make based on the results of statistical tests - is based entirely on probability. It's the probability that the outcome we saw was unlikely - so unlikely that we can conclude it had to be because of a real difference rather than just chance.
So let's go with a concrete example. Let's say I do an experiment, and I want to test the effect of caffeine consumption on test performance. I think having caffeine before you take a test is going to give you a better grade than no caffeine (my hypothesis). This exact study has probably been done hundreds or thousands of times. I've got my experimental group that I have drink caffeinated coffee, and my control group that I force to drink decaf, because I'm a mean person - but sometimes it's necessary for science. And then I give them a test and see how well they did. I expect that, on average, the experimental group will do better. Now, I can't just look and go, "Hey, the experimental group has a higher score!" It would be really really unlikely for the groups to have the exact same average, but how much different do they need to be before we would say there is a real difference? We would set a cut-off, a critical value - if the difference is at least this big, we will conclude that there is a real difference between my two groups and not just random chance.
And the way that we set that critical value is with probability. We go with a difference that would have a small chance of happening by random luck alone. And that is alpha. We set that ahead of time. The most common alpha - the convention - is 0.05, which means that the critical value is based on a difference we would have only a 5% chance of seeing if we just had random luck operating - if there wasn't a true difference between caffeine consumers and non-caffeine consumers.
But that's 5% - that's not 0! There's still a chance that we'll find a difference that isn't real; it's not because of the caffeine, it's just luck. We accept that. We know that's a possibility. We can't really know from a single study whether there is a real effect of caffeine, or whether we've fallen in that 5%. In fact, when we fall in that 5%, we've made what's called a Type I error. We're saying, "Hey, there's something here!" when really there's nothing.
You know in horror movies, which I blog about a lot, someone walks in a room and we think the murderer is going to jump out at them, and instead it's just a cat? They jump, we jump, everyone freaks out. It's like that. We reacted like it was the murderer but it was just the cat.
That's Type I error. Except in this case, we don't have the immediate feedback of seeing the cat or not being dead to know we didn't actually find something real. All we know is we found something and it made us jump. We don't know if we should have jumped or not. This probably why I hate jump scares in horror movies. I'm traumatized by the possibility of Type I error.
So does that mean there are a bunch of studies out there that have found results that are just bogus, that are just Type I error? Yes, it does. Because with an alpha of 0.05, that means if there is no real effect, and you do that study 100 times, 5 of them will probably come out significant. And considering that we have this thing called publication bias, where studies that find significant results are more likely to published, there's a whole lot of type I error floating around out there. This is why replication is so important and needs to be encouraged. And why we need to stop publication bias. And we also need to stop something I've blogged about before called p-hacking.
P-hacking is directly related to alpha. When you have a huge dataset, tons of variables, and you just run analyses willy nilly, looking for a significant result, you're dramatically increasing the chance that you'll commit a Type I error, that at least one of those results will be significant just by chance. If you have an alpha of 0.05 each time you run a test, those probabilities add up. Because even if there's no relationship between two variables, there's a 5% chance you'll find one anyway. And that's just for 1 test. If you run 2 tests, it's 10%, 3 tests, 15%, and so on and so on. If you run 20 tests, 1 of those is probably going to be significant just by luck. If you 100 tests, 5 will probably be significant just by chance.
And if you don't tell people, "Oh, by the way, we just ran a shit-ton of tests, and only reported the few that were significant," they might not realize how much you inflated your Type I error rate. So this is why you shouldn't do a bunch of a tests, and if you're going to do a whole bunch of tests you should, plan them ahead of time, and do a correction to your alpha. It's 5% each shot. So if you're doing 10 tests, you've inflated that to 50%. So you should instead take that .05 and divide it by the number of tests you're going to do.
If you're doing 10 tests, your alpha for each test is .05/10. If you want to learn a new term to impress your friends, that is called a Bonferoni correction. It's named after a guy named Bonferoni, who came up with it. I can kind of understand wanting to name something after yourself, especially with a name like Bonferoni, because I bet he got made fun of for that name (people still make fun of it), and the best way to get back at the haters is to make them say your name with a little bit of respect. I can get behind that.