I first want to offer the caveat that Ioannidis examined quantitative research. The issues affecting the accuracy of qualitative research are different (I won’t say non-existent, because qualitative research is definitely not infallible, just that these particular results really only apply to studies done where the collected data are numerical).
The underlying concept they’re trying to get at here is validity, defined as truth or, specifically whether the conclusions drawn from a study are a correct, accurate reflection of the topic under study. Though we can never really know the truth, we can get at it through many different types of research, performed in different settings, with different people, etc. Validity is a big concept that encompasses many different types of truth. In research, we think of four types of validity: internal, external, construct, and statistical conclusion.
Most people can understand the concept of validity, but occasionally struggle with the four types. Therefore, I’m going to use one hypothesis to show the various different kinds of validity. This hypothesis comes from a conversation I was having with a friend one day. I told a recently heard, and quite dirty, joke, and afterward, said I should probably keep my telling of dirty jokes to a minimum. To which my friend replied, “You can never have too many dirty jokes.” And of course, being a scientist, I said, “I think we should empirically test that hypothesis.” Little did my friend know, I was only half joking.
So let’s say I wanted to design a study to test this hypothesis. First, I’d need to alter the hypothesis somewhat, unless I’m willing to allow an infinite number of dirty jokes (because I doubt you could actually set up a study to test a “never” contingency), but I’d want to get at the underlying topic of number of allowable dirty jokes. I would have to set up a situation where I could determine at what point someone hearing the dirty jokes requests that they stop. I’d have to pick a certain setting to conduct this study, and have at least two people there (perhaps more): one to tell the dirty jokes, and one to listen and determine when the jokes should stop. I’d have to make sure the joke-teller has enough dirty jokes in his/her repertoire so that the experiment could go on as long as needed - so that the only person calling a halt to the jokes is the listener (or listeners) - but would probably set up a time or number-of-jokes limit so that the participants (and the researchers, for that matter) aren't stuck there forever. I might also want to add another condition, where the joke-teller tells clean jokes; it’s possible that people just get fatigued listening to jokes in general, so we’d want to determine if there’s something different about dirty jokes that may increase or decrease the number a person is willing to hear before saying enough.
All of the above would help us to establish strong internal validity, certainty that our independent variable (the jokes) actually caused our dependent variable (the request to stop telling jokes). If I didn’t have the additional, clean-joke condition, I could still test at what point the person hearing dirty jokes asks they stop, but I’d be less certain it was the dirty jokes causing the request, rather than jokes in general (or just being forced to listen to one person talk for a long time, another potential comparison condition).
Okay, so imagine that I did this study with people hearing dirty jokes from someone (one-on-one, so there was only one joke-teller and one joke-hearer) and other people hearing clean jokes. Let’s say they were randomly assigned to hear either clean or dirty jokes, so that we could expect any additional characteristics affecting our outcome (e.g., poor sense of humor, intolerance for sexual references, etc.) would be evenly divided across groups. And let’s say I found that, on average, people are willing to hear 5 dirty jokes before asking the joke-teller stop (compared to, say, 10 clean jokes).
Does this mean, if I’m at a family reunion, with my rather large family, I know I can probably get away with 5 dirty jokes before someone says, “Okay, Sara, that’s enough. You mean to tell us we helped you through grad school so you could become a female Patton Oswalt?”? Not necessarily. Remember, I did the study in a one-on-one situation. My results may not generalize to group situations. This refers to the notion of external validity, the degree to which the findings of a study can generalize to other people or situations. It doesn’t mean my results are wrong if I find that at my family gathering, I can tell 20 jokes before someone says, “Okay, that’s probably enough.”. It just may mean that groups are different than individual people.
I’d want to do another study using groups instead of individuals, to examine how the effect may differ. I may find that certain groups (e.g., my family) are more tolerant of dirty jokes and allow a greater number to be told than other groups (e.g., my fellow congregants at Sunday mass), and may even find that the same people can be more or less tolerant of dirty jokes depending on our current situation (such as telling jokes to fellow congregants while at church versus telling the same people jokes while we’re out at the bar).
One thing that is important for any of the studies discussed above is how I’m defining my variables. What exactly do I mean by “dirty jokes”? Do I mean jokes with foul language? Sexual content? Something else? Once again, if I do a study and find that people are quite tolerant of dirty jokes and allow a dozen to be told before saying “enough”, and another researcher finds the number to be much lower (say three), it doesn’t necessarily mean one of us did a poor study. Even if we both did the study in the same situation, with the same types of people, we might find different results if we defined “dirty jokes” differently. And while we could probably think of multiple good definitions of “dirty joke”, some definitions are better than others. If, in my study, I defined “dirty jokes” as jokes about dirt and mud, then that could be a big reason for my different results; the way I defined the construct “dirty joke” was not very accurate, so the construct validity is low.
|If this is your idea of a "dirty joke", you should check out Sesame Street's True Mud sketch.
The same thing can be said about statistics; if I ignore the rules on when I can use a specific statistical formula and use it anyway, my results could be incorrect. For example, one assumption of many tests is that the dependent variable (the outcome) is normally distributed (i.e., the “bell curve” - this is why, in any stats class, the normal distribution is one of the first things you learn; it’s the underlying assumption of most of the tests you learn in those classes). If we want to use one of those tests, and our dependent variable is skewed, we may draw the wrong conclusion from our results.
Of course, even if you do a study in the best, most controlled, most accurate way possible, you might still draw the wrong conclusion. Sometimes weird stuff happens: even with random assignment, we might have some weird fluke where all the people with good senses of humor end up in one group. Or I might do the study on my family on a really good day, when they’re willing to hear way more dirty jokes than they would on any other day, meaning my results are not just limited to my particular family, but to my family on a very special kind of day. This is why we keep studying a topic, even if many others have already studied it. And we can’t just limit ourselves to one type of research, such as lab studies with lots of control and random assignment to groups. If you study a topic in many different ways (lab studies, observational studies, interviews) and find generally the same results in all of them, we can be even more certain our conclusions are accurate, and that we’ve gotten to close to finding that elusive concept of truth. And recognize that things can go wrong. It’s not the end of the world; just keep studying and have a good sense of humor.