Sunday, January 15, 2012

False Research Findings, Truth, and Dirty Jokes

I recently came across an article in PLoS Medicine (Ioannidis, 2005), which concluded that most published research is incorrect. They went on to explain many of the factors that affect whether a study came to the correct conclusion. Though this article was published about seven years ago, it’s been circulating once again, because the points the article makes are still important and relevant. And given some recent, high profile instances of fabricated research findings (see previous blog post), it’s important to keep in mind that simply because a particular finding is not replicable doesn’t automatically mean the researcher(s) made up stuff. There are many logical reasons for why a researcher may find something, through no fault of the researchers or the study design, that simply isn’t true.

I first want to offer the caveat that Ioannidis examined quantitative research. The issues affecting the accuracy of qualitative research are different (I won’t say non-existent, because qualitative research is definitely not infallible, just that these particular results really only apply to studies done where the collected data are numerical).

The underlying concept they’re trying to get at here is validity, defined as truth or, specifically whether the conclusions drawn from a study are a correct, accurate reflection of the topic under study. Though we can never really know the truth, we can get at it through many different types of research, performed in different settings, with different people, etc. Validity is a big concept that encompasses many different types of truth. In research, we think of four types of validity: internal, external, construct, and statistical conclusion.

Most people can understand the concept of validity, but occasionally struggle with the four types. Therefore, I’m going to use one hypothesis to show the various different kinds of validity. This hypothesis comes from a conversation I was having with a friend one day. I told a recently heard, and quite dirty, joke, and afterward, said I should probably keep my telling of dirty jokes to a minimum. To which my friend replied, “You can never have too many dirty jokes.” And of course, being a scientist, I said, “I think we should empirically test that hypothesis.” Little did my friend know, I was only half joking.

So let’s say I wanted to design a study to test this hypothesis. First, I’d need to alter the hypothesis somewhat, unless I’m willing to allow an infinite number of dirty jokes (because I doubt you could actually set up a study to test a “never” contingency), but I’d want to get at the underlying topic of number of allowable dirty jokes. I would have to set up a situation where I could determine at what point someone hearing the dirty jokes requests that they stop. I’d have to pick a certain setting to conduct this study, and have at least two people there (perhaps more): one to tell the dirty jokes, and one to listen and determine when the jokes should stop. I’d have to make sure the joke-teller has enough dirty jokes in his/her repertoire so that the experiment could go on as long as needed - so that the only person calling a halt to the jokes is the listener (or listeners) - but would probably set up a time or number-of-jokes limit so that the participants (and the researchers, for that matter) aren't stuck there forever. I might also want to add another condition, where the joke-teller tells clean jokes; it’s possible that people just get fatigued listening to jokes in general, so we’d want to determine if there’s something different about dirty jokes that may increase or decrease the number a person is willing to hear before saying enough.

All of the above would help us to establish strong internal validity, certainty that our independent variable (the jokes) actually caused our dependent variable (the request to stop telling jokes). If I didn’t have the additional, clean-joke condition, I could still test at what point the person hearing dirty jokes asks they stop, but I’d be less certain it was the dirty jokes causing the request, rather than jokes in general (or just being forced to listen to one person talk for a long time, another potential comparison condition).

Okay, so imagine that I did this study with people hearing dirty jokes from someone (one-on-one, so there was only one joke-teller and one joke-hearer) and other people hearing clean jokes. Let’s say they were randomly assigned to hear either clean or dirty jokes, so that we could expect any additional characteristics affecting our outcome (e.g., poor sense of humor, intolerance for sexual references, etc.) would be evenly divided across groups. And let’s say I found that, on average, people are willing to hear 5 dirty jokes before asking the joke-teller stop (compared to, say, 10 clean jokes).

Does this mean, if I’m at a family reunion, with my rather large family, I know I can probably get away with 5 dirty jokes before someone says, “Okay, Sara, that’s enough. You mean to tell us we helped you through grad school so you could become a female Patton Oswalt?”? Not necessarily. Remember, I did the study in a one-on-one situation. My results may not generalize to group situations. This refers to the notion of external validity, the degree to which the findings of a study can generalize to other people or situations. It doesn’t mean my results are wrong if I find that at my family gathering, I can tell 20 jokes before someone says, “Okay, that’s probably enough.”. It just may mean that groups are different than individual people.

I’d want to do another study using groups instead of individuals, to examine how the effect may differ. I may find that certain groups (e.g., my family) are more tolerant of dirty jokes and allow a greater number to be told than other groups (e.g., my fellow congregants at Sunday mass), and may even find that the same people can be more or less tolerant of dirty jokes depending on our current situation (such as telling jokes to fellow congregants while at church versus telling the same people jokes while we’re out at the bar).

One thing that is important for any of the studies discussed above is how I’m defining my variables. What exactly do I mean by “dirty jokes”? Do I mean jokes with foul language? Sexual content? Something else? Once again, if I do a study and find that people are quite tolerant of dirty jokes and allow a dozen to be told before saying “enough”, and another researcher finds the number to be much lower (say three), it doesn’t necessarily mean one of us did a poor study. Even if we both did the study in the same situation, with the same types of people, we might find different results if we defined “dirty jokes” differently. And while we could probably think of multiple good definitions of “dirty joke”, some definitions are better than others. If, in my study, I defined “dirty jokes” as jokes about dirt and mud, then that could be a big reason for my different results; the way I defined the construct “dirty joke” was not very accurate, so the construct validity is low.

If this is your idea of a "dirty joke", you should check out Sesame Street's True Mud sketch.
Finally, statistical conclusion validity refers to whether I used the statistics to analyze my data correctly. Probably most people are with me until this point in the validity lesson, because when I mention statistics, I see eyes start to glaze over. To put this in the most basic way, math has rules (in statistics, we call them assumptions, but they amount to the same thing). If we don’t follow those rules, we get the wrong answer, like if we start adding, subtracting, and multiplying a long string of numbers without following the proper order of operations (remember PEMDAS? - parentheses, exponents, multiplication, division, addition, subtraction; you have deal with numbers in parentheses before numbers outside, multiply numbers before you can divide, etc.). If a number has a decimal point in front of it, we can’t ignore it and pretend it’s a whole number, or if we’re told to add a negative number to a value, we can’t ignore the negative sign.  [And if you want to try to make the argument that negative numbers don't actually exist, so why should you have to learn to do math with them?, obviously you've never had student loans.]

The same thing can be said about statistics; if I ignore the rules on when I can use a specific statistical formula and use it anyway, my results could be incorrect. For example, one assumption of many tests is that the dependent variable (the outcome) is normally distributed (i.e., the “bell curve” - this is why, in any stats class, the normal distribution is one of the first things you learn; it’s the underlying assumption of most of the tests you learn in those classes). If we want to use one of those tests, and our dependent variable is skewed, we may draw the wrong conclusion from our results.

Of course, even if you do a study in the best, most controlled, most accurate way possible, you might still draw the wrong conclusion. Sometimes weird stuff happens: even with random assignment, we might have some weird fluke where all the people with good senses of humor end up in one group. Or I might do the study on my family on a really good day, when they’re willing to hear way more dirty jokes than they would on any other day, meaning my results are not just limited to my particular family, but to my family on a very special kind of day. This is why we keep studying a topic, even if many others have already studied it. And we can’t just limit ourselves to one type of research, such as lab studies with lots of control and random assignment to groups. If you study a topic in many different ways (lab studies, observational studies, interviews) and find generally the same results in all of them, we can be even more certain our conclusions are accurate, and that we’ve gotten to close to finding that elusive concept of truth. And recognize that things can go wrong. It’s not the end of the world; just keep studying and have a good sense of humor.

Thoughtfully yours,

No comments:

Post a Comment