Sunday, January 15, 2012

False Research Findings, Truth, and Dirty Jokes

I recently came across an article in PLoS Medicine (Ioannidis, 2005), which concluded that most published research is incorrect. They went on to explain many of the factors that affect whether a study came to the correct conclusion. Though this article was published about seven years ago, it’s been circulating once again, because the points the article makes are still important and relevant. And given some recent, high profile instances of fabricated research findings (see previous blog post), it’s important to keep in mind that simply because a particular finding is not replicable doesn’t automatically mean the researcher(s) made up stuff. There are many logical reasons for why a researcher may find something, through no fault of the researchers or the study design, that simply isn’t true.

I first want to offer the caveat that Ioannidis examined quantitative research. The issues affecting the accuracy of qualitative research are different (I won’t say non-existent, because qualitative research is definitely not infallible, just that these particular results really only apply to studies done where the collected data are numerical).

The underlying concept they’re trying to get at here is validity, defined as truth or, specifically whether the conclusions drawn from a study are a correct, accurate reflection of the topic under study. Though we can never really know the truth, we can get at it through many different types of research, performed in different settings, with different people, etc. Validity is a big concept that encompasses many different types of truth. In research, we think of four types of validity: internal, external, construct, and statistical conclusion.

Most people can understand the concept of validity, but occasionally struggle with the four types. Therefore, I’m going to use one hypothesis to show the various different kinds of validity. This hypothesis comes from a conversation I was having with a friend one day. I told a recently heard, and quite dirty, joke, and afterward, said I should probably keep my telling of dirty jokes to a minimum. To which my friend replied, “You can never have too many dirty jokes.” And of course, being a scientist, I said, “I think we should empirically test that hypothesis.” Little did my friend know, I was only half joking.

So let’s say I wanted to design a study to test this hypothesis. First, I’d need to alter the hypothesis somewhat, unless I’m willing to allow an infinite number of dirty jokes (because I doubt you could actually set up a study to test a “never” contingency), but I’d want to get at the underlying topic of number of allowable dirty jokes. I would have to set up a situation where I could determine at what point someone hearing the dirty jokes requests that they stop. I’d have to pick a certain setting to conduct this study, and have at least two people there (perhaps more): one to tell the dirty jokes, and one to listen and determine when the jokes should stop. I’d have to make sure the joke-teller has enough dirty jokes in his/her repertoire so that the experiment could go on as long as needed - so that the only person calling a halt to the jokes is the listener (or listeners) - but would probably set up a time or number-of-jokes limit so that the participants (and the researchers, for that matter) aren't stuck there forever. I might also want to add another condition, where the joke-teller tells clean jokes; it’s possible that people just get fatigued listening to jokes in general, so we’d want to determine if there’s something different about dirty jokes that may increase or decrease the number a person is willing to hear before saying enough.

All of the above would help us to establish strong internal validity, certainty that our independent variable (the jokes) actually caused our dependent variable (the request to stop telling jokes). If I didn’t have the additional, clean-joke condition, I could still test at what point the person hearing dirty jokes asks they stop, but I’d be less certain it was the dirty jokes causing the request, rather than jokes in general (or just being forced to listen to one person talk for a long time, another potential comparison condition).

Okay, so imagine that I did this study with people hearing dirty jokes from someone (one-on-one, so there was only one joke-teller and one joke-hearer) and other people hearing clean jokes. Let’s say they were randomly assigned to hear either clean or dirty jokes, so that we could expect any additional characteristics affecting our outcome (e.g., poor sense of humor, intolerance for sexual references, etc.) would be evenly divided across groups. And let’s say I found that, on average, people are willing to hear 5 dirty jokes before asking the joke-teller stop (compared to, say, 10 clean jokes).

Does this mean, if I’m at a family reunion, with my rather large family, I know I can probably get away with 5 dirty jokes before someone says, “Okay, Sara, that’s enough. You mean to tell us we helped you through grad school so you could become a female Patton Oswalt?”? Not necessarily. Remember, I did the study in a one-on-one situation. My results may not generalize to group situations. This refers to the notion of external validity, the degree to which the findings of a study can generalize to other people or situations. It doesn’t mean my results are wrong if I find that at my family gathering, I can tell 20 jokes before someone says, “Okay, that’s probably enough.”. It just may mean that groups are different than individual people.

I’d want to do another study using groups instead of individuals, to examine how the effect may differ. I may find that certain groups (e.g., my family) are more tolerant of dirty jokes and allow a greater number to be told than other groups (e.g., my fellow congregants at Sunday mass), and may even find that the same people can be more or less tolerant of dirty jokes depending on our current situation (such as telling jokes to fellow congregants while at church versus telling the same people jokes while we’re out at the bar).

One thing that is important for any of the studies discussed above is how I’m defining my variables. What exactly do I mean by “dirty jokes”? Do I mean jokes with foul language? Sexual content? Something else? Once again, if I do a study and find that people are quite tolerant of dirty jokes and allow a dozen to be told before saying “enough”, and another researcher finds the number to be much lower (say three), it doesn’t necessarily mean one of us did a poor study. Even if we both did the study in the same situation, with the same types of people, we might find different results if we defined “dirty jokes” differently. And while we could probably think of multiple good definitions of “dirty joke”, some definitions are better than others. If, in my study, I defined “dirty jokes” as jokes about dirt and mud, then that could be a big reason for my different results; the way I defined the construct “dirty joke” was not very accurate, so the construct validity is low.

If this is your idea of a "dirty joke", you should check out Sesame Street's True Mud sketch.
Finally, statistical conclusion validity refers to whether I used the statistics to analyze my data correctly. Probably most people are with me until this point in the validity lesson, because when I mention statistics, I see eyes start to glaze over. To put this in the most basic way, math has rules (in statistics, we call them assumptions, but they amount to the same thing). If we don’t follow those rules, we get the wrong answer, like if we start adding, subtracting, and multiplying a long string of numbers without following the proper order of operations (remember PEMDAS? - parentheses, exponents, multiplication, division, addition, subtraction; you have deal with numbers in parentheses before numbers outside, multiply numbers before you can divide, etc.). If a number has a decimal point in front of it, we can’t ignore it and pretend it’s a whole number, or if we’re told to add a negative number to a value, we can’t ignore the negative sign.  [And if you want to try to make the argument that negative numbers don't actually exist, so why should you have to learn to do math with them?, obviously you've never had student loans.]

The same thing can be said about statistics; if I ignore the rules on when I can use a specific statistical formula and use it anyway, my results could be incorrect. For example, one assumption of many tests is that the dependent variable (the outcome) is normally distributed (i.e., the “bell curve” - this is why, in any stats class, the normal distribution is one of the first things you learn; it’s the underlying assumption of most of the tests you learn in those classes). If we want to use one of those tests, and our dependent variable is skewed, we may draw the wrong conclusion from our results.

Of course, even if you do a study in the best, most controlled, most accurate way possible, you might still draw the wrong conclusion. Sometimes weird stuff happens: even with random assignment, we might have some weird fluke where all the people with good senses of humor end up in one group. Or I might do the study on my family on a really good day, when they’re willing to hear way more dirty jokes than they would on any other day, meaning my results are not just limited to my particular family, but to my family on a very special kind of day. This is why we keep studying a topic, even if many others have already studied it. And we can’t just limit ourselves to one type of research, such as lab studies with lots of control and random assignment to groups. If you study a topic in many different ways (lab studies, observational studies, interviews) and find generally the same results in all of them, we can be even more certain our conclusions are accurate, and that we’ve gotten to close to finding that elusive concept of truth. And recognize that things can go wrong. It’s not the end of the world; just keep studying and have a good sense of humor.

Thoughtfully yours,

Thursday, January 12, 2012

Dr. Pepper Ten: The Product is “Not for Women”, But the Commercials Are

No doubt, you’ve seen and heard about Dr. Pepper’s new soda, Dr. Pepper Ten, a low-calorie beverage that, unlike diet sodas, uses real sugar. And you’ve probably heard their commercials that feature manly men talking about action movies, duct tape, and bacon.

Mmmm, bacon…

Sorry, where was I? Oh, yes, the commercials. The purpose of the testosterone-infused advertising is in response to research showing that men are not interested in drinking diet sodas because they are perceived as being “girly” (find out more here). This soda was also developed to be a low-calorie option that didn’t taste like diet soda, because many people have issues with the taste of artificially sweetened beverages.

Word. There are few flavors in this world I dislike as much as artificial sweetener.

So in order to cater to men who want a diet beverage they can feel comfortable drinking with the guys, Dr. Pepper created Ten and created ads (likely spending millions of dollars on said ad campaign) that focus on men.

But they don't.

Listen to the ads. They’re always addressing women, without any statements toward men. Rather than saying, “Hey guys, want a beverage that recognizes your desire to be calorie conscious without all the estrogen? Try Dr. Pepper Ten.”, they start out the ads with, “Ladies…” and go on to explain to women why this beverage isn’t for them.

“Hey, ladies! This soda? Not for you…
 Wait, where are you going? I wasn’t finished explaining why this soda isn’t for you.”

Perhaps the aim is to remind guys of their days building clubhouses with their friends and putting up the “No girls allowed” sign (rather than a “Boys only” sign, which would have made a lot more sense). It’s also possible that the goal is to get women interested in trying the soda, because of the way people respond to being told not to do something. Specifically, they may be trying to elicit psychological reactance.

Humans are motivated to believe they have free will, as in control over their actions (whether you actually have free will – well, that’s something philosophers have been arguing about forever, so we won’t even go there right now). When someone tells you not to do something, your free will is threatened, and so you will behave in a way to reaffirm your sense of free will; the best way to do that is to do the thing you were just told not to do.

Parents are very familiar with this concept.

And I’ll admit, one thing that really drives me nuts is being told I am not allowed to do something or am even incapable of doing something (especially things that are learned) by virtue of my genitalia. Because apparently, the ability to change my oil, troubleshoot my computer, and hammer a nail are tied to the Y chromosome. “No point in teaching a woman to do any of those things. She’d never be able to learn it. So I’m going to avoid teaching her those things just to prove my point.” <sarcasm>Wow, your logic is infallible.</sarcasm… for now>

There’s a reason that social scientists insist on using the term “gender” in research. It’s not that we have an aversion to the word “sex”; it’s that we recognize “sex” is a biological term, whereas “gender” is a social term. Yes, because I am a woman, I have been shaped to behave in certain ways and believe certain things (and this perspective is also why I’m writing this blog entry and focusing on these issues). At the same time, I have my own unique set of traits, abilities, beliefs, and attitudes that were shaped by a variety of factors, not just the fact that I am a woman. The same is true for everyone; we were all shaped to be the way we are by our unique experiences, and throwing us all into one big category doesn’t make us all the same. Just like calling a calorie “manly” doesn’t make it so.

My point is that, perhaps they’re posting the “No Girls Allowed” sign while secretly hoping the girls will come around. And if that were my only reaction to Dr. Pepper Ten, I might just say, “What the heck, I’ll try it.”

After all, torque is a rather fascinating word.

There’s more to it than that, of course. Not only is Dr. Pepper Ten dragging out every gender stereotype possible, which has some documented effects on women’s performance in certain domains (see previous post), this issue of diet soda and gender has many more ramifications.

One of the reasons diet soda is so popular with women is because of our society’s focus on women’s bodies and the stigma associated with female overweight and obesity.

What stigma are men concerned about? Apparently, being seen drinking diet soda in public.

Forgive me if I’m not feeling too sympathetic, guys.

In all seriousness, I know that body image is also a serious concern for men, and have known more than one man who developed an eating disorder in response to pressures to look a certain way. Even so, women are constantly bombarded with messages to be thin, not just through the media, but in the fashion world overall. Clothing is often designed with thinner women in mind, and simply sized up to fit larger women; of course, the styles that look good on thinner women often differ from styles that look good on larger women, so this “sizing up” doesn’t necessarily allow women in larger sizes to look, and more importantly feel, good. And the messages come from our peers, too, even other women, who are often the worst offenders in making women feel bad about how they look.

I’d like to take a moment to thank those people who go out of their way to make me feel fat. At the very least, you’ve proven to me that being thin doesn’t make you happy or a good person.

And honestly, research has shown that no one really likes the word “diet”. In fact, some weight management programs are exploring new titles, like “wellness-focused”, and finding that people still have positive weight loss outcomes without needing to include words like “diet” and “weight”. Dr. Pepper Ten could probably still be a successful beverage because it doesn’t use words like diet, instead focusing on being a lower-calorie alternative that (presumably) tastes like a non-diet drink.

But at the end of the day, what Dr. Pepper Ten’s advertising makes me think of – besides, “Come on, aren’t we all smarter than this?” – is the Monty Python “Lumberjack” song, where the manly lumberjack suddenly discusses how much he enjoys wearing high heels and a bra. Yeah, the Dr. Pepper Ten commercials are just like that except, you know, not nearly as funny.

Men: I’m interested in hearing what you think about Ten, and what you think about an advertising campaign that is supposed to be all about you without actually addressing you directly. Does their need to preface words like “calories” with “manly” make any difference? Or do you find the commercials as idiotic and irritating as I?

Thoughtfully yours,

Monday, January 2, 2012

An Open Letter to Calphon: The Importance of Operational Definitions

Dear Calphon,

I've been using your products ever since I received a set of Calphon pots and pans as a wedding present, about a year and a half ago. Though there are many things about your products to love - attractive, interchangeable lids to fit every sauce pan and skillet, oven safe - the "non-stick" aspect is laughable. Not only do I have to use obscene amounts of olive oil to prevent my food from enacting a death grip to your product, even then, the food is practically pulled apart as I try to pry it off.

Now, I was about to just post a snarky message on Facebook about the lack of non-stick on a non-stick product and leave it at that, but realized that there must be a logical explanation for the problems I'm having with your product. And, as I thought about it scientifically, I realized what we have here is a difference in operational definitions.

"Operational definitions?", you say, "What are those?" Allow me to explain.

Operational definitions are definitions that allow a concept to measured or manipulated. In research, especially social science research, we often try to study variables that are elusive, like love, intelligence, and aggression. We can't simply hold a ruler up to someone and say, "Their love score is 18." We have to define how we will determine a person's "love score", or IQ, or whatever we're studying, whether that be through a standardized measure, observation of behavior, or some other way. In fact, if you look around, operational definitions are everywhere, because we regularly measure things, even outside of research, that must first be defined.

For example, everyone who lives in the state of Illinois can tell you the operational definition of "intoxicated".

Look familiar?
The signs are posted on highways throughout the state, so we know that the operational definition of intoxicated in Illinois is a blood alcohol level of .08 or higher (and currently, that's the legal limit in all US states, though in the past, there have been some differences in how states have defined intoxicated).

There are other operational definitions floating around out there. For example, the Seinfeld episode in which the gang debated whether soup was a meal involved a discussion of what is (and is not) a meal. And many people have debated what makes something a "date" - for example, what activities should be involved, who should pay, time of day, and so on. Ever been to a social gathering and heard someone say, "This isn't a party. It's not a party unless…"? Those are operational definitions.

A good operational definition should be clear enough that anyone can walk in to your study (or conversation) and, based on the established definition, correctly identify a specific case. Of course, people may disagree on what makes a good operational definition - this is why operational definitions should be discussed and established before beginning a study. And for many variables, there are any number of operational definitions.

For example, blood alcohol level is one way to define intoxicated, but you could also have gone with ability to walk a straight line or say the alphabet backwards. Different operational definitions, however, may cause you come to different conclusions. A person may be classified as intoxicated if they are unable to walk a straight line but sober if their blood alcohol level is .02. (And by knowing the two pieces of information - unable to walk a straight line but blood alcohol level of .02 - we can come to a different conclusion: sober but uncoordinated.)

And this is where our misunderstanding comes from, Calphon. We have different operational definitions of non-stick. Mine is probably something like this:

Non-stick = Food can be removed with no pieces being ripped off

Yours must be something like this:

Non-stick = Food can be removed with great effort and large pieces being ripped off, so that my beautiful goat cheese-stuffed chicken breast looks more like chicken and goat cheese cobbler

See the problem? So based on my definition, your pan would not be considered non-stick, but based on your definition, it would. This is the problem with using undefined words like "non-stick". Now I'm wondering about that whole "oven-safe" bit. There probably isn't any room on your packaging to offer a good operational definition of your terms, but that's all right; you're more than welcome to put that information on your website. It would be most appreciated!

Thoughtfully yours,