## Monday, April 10, 2017

### H is for Histogram

Whenever I'm working with data, I always ask it the same question:

How the data are distributed can affect what sort of analysis you can do, as well as whether any corrections are needed. If I'm analyzing a continuous variable - basically any variable where it would make sense to report a mean (average) - I most likely want it to look like our old friend, the normal distribution:

That's because most of our statistical tests are based on the assumption that the data look pretty close to the normal curve above. All of our statistical tests are based on probability, and we know a lot about the probability of scores falling in certain parts of the normal curve. So if our data resembles that curve, we know something about the probability of scores occurring in our sample.

How do we find out if our data are normally distributed? The first step is to graph it, by frequencies of different scores. When you create a plot that has scores on the x-axis (going across the bottom of the graph) and frequencies (counts) on the y-axis (going along the side), you are creating a histogram. In fact, the normal distribution above is a histogram.

You might be asking how a histogram differs from a bar chart. Bar charts are used for categorical data (things like gender, where it makes no sense to report a mean), while histograms are used for continuous data. Bar charts also have spaces between the bars, but histograms do not (because the variable being reported is continuous, and each value on the x-axis leads into the next). Depending on the range of values for the variable, the histogram might be displayed with each possible score having its own bar, or scores might be grouped together in "slices" or "bins."

Here's a histogram I created in R, using randomly generated data for 60 people:

As you can see by the x-axis, the data are "sliced" into ranges of 5. Here's the same data, with smaller slices (ranges of 2):

The distribution is still approximately the same, though slightly less "normal." There are other (better) ways to find out if your data are close enough to the normal distribution, but it's always good to start by eye-balling the data in this way.