Wednesday, April 26, 2017

V is for Venn Diagram

You've probably all seen Venn Diagrams before. In fact, they're often used to tell some great jokes:


What you may not know is that Venn Diagrams are one way to represent the concepts behind set theory, and that set theory has some important applications to statistics. Set theory involves mathematical logic. A set is a collection of objects - it could be people, animals, concepts, anything that can be grouped together. Logic statements are used to describe how sets relate to each other.

For instance, sets can be mutually exclusive, which is also known as symmetric difference. This means that objects can only be in one of two sets. The Venn diagram above is an example. People either find Venn diagram jokes funny or they don't; they can't be in both sets at once.

Sets can also have some overlap, where an object can be a member of set 1 and set 2. This is referred to as intersection. In a typical Venn diagram, it's the part where the two circles overlap.


You can use Venn diagrams to describe set difference - the members that are in one set but not the other. This is different from symmetric difference; symmetric difference means it's impossible to be a member of both sets, while set difference refers simply those cases that just happen to be in one set. In the hipster Venn diagram above, the blue section is an example of set difference.

You can also have subsets - these would be circles that are fully contained within a larger circle. For instance:


Set theory and Venn diagrams are ways to describe data. For instance, I recently did a survey for my choir to help with planning our benefit. One question asked which days of the week people would prefer (Thursday, Friday, or Saturday), and they were allowed to select more than one; I used a Venn diagram to display intersection and set difference among the day of week options.

In fact, a few years ago, I started learning an analysis technique that is based on set theory: qualitative comparative analysis or QCA. As I said before in my post about beta, power (your ability to find a significant effect if one exists) is based in part on sample size. If you don't have enough cases in your sample, you might miss an effect. But sometimes, you may be studying something that is rare and your sample size has to be small. QCA works with small sample sizes and lets you explore relationships between characteristics and an outcome. Specifically, it helps you identify necessary and/or sufficient conditions to achieve whatever outcome you're interested in.

You've probably encountered those concepts before, but many people struggle with them because they're usually not very well-described. Necessary means if the condition is absent, the outcome is absent. If you want win the lottery, you have to have a lottery ticket. You can't win without a ticket, so that would be a necessary condition. Sufficient means if the condition is present, the outcome is present. If you have the winning lottery numbers, you win the lottery.

Things can be one but not the other. Having a lottery ticket doesn't automatically mean you'll win (unfortunately), so the ticket itself is necessary but not sufficient. Being a beagle is a sufficient condition for being a dog - because all beagles are dogs - but it isn't a necessary condition, because there are other kinds of dogs.

You could probably do a simple QCA by hand, though, as with most statistics, you're better off using a computer program. I've used a library built for the R statistical package to do my QCAs.

No comments:

Post a Comment