## Sunday, April 30, 2017

### Z is for Z-Scores

I've been showing pictures of this guy all month:

Finally, it's time to talk about the standard normal distribution. Used to represent normally distributed variables in the population, the standard normal distribution has some very specific properties. And though I'd usually teach about this distribution early on in a statistics course, the nice thing about teaching it now, at the end, is that I can reference back to previous concepts you've learned. Honestly, you're probably going to understand this guy much better after having read the previous posts.

You may notice a character that looks a little like a u at the center of the distribution above. That is the lowercase Greek letter mu (μ), which is the population mean. So if someone talks about the mu of the distribution, you know right away that they're talking about population, not sample, values. Mu is always used specifically to talk about the mean, but in going back to the post on descriptive statistics, any normal distribution will have the same mean (average), median (middle score), and mode (most frequent score). So while mu will only be used to refer to the mean, the median and the mode will be equal to mu in a normal distribution.

You may also notice another character that looks like an o. This is lowercase sigma (σ), which refers to the population standard deviation. Once again, if you hear someone talking about the sigma of a distribution, you'll know they're referring to population values. (And if they refer to sigma-squared, they're talking about population variance.)

One of the specific properties of the standard normal distribution has to do with the proportion of scores falling within specific ranges. If a variable follows the standard normal distribution, 68% of scores will fall within -1σ and +1σ. For instance, on a cognitive ability test normed to have a μ of 100 and a σ of 15, we know that 68% of people will have a score between 85 and 115. Further, 95% will have scores between -2σ and +2σ, and 99% will have scores between -3σ and +3σ.

Usually, when we talk about variables that follow the standard normal distribution, we convert the scores to a standardized metric - in fact, to talk about the standard normal distribution at all, you need have standardized scores. In this case, we standardize scores so that the μ of the distribution is 0 and σ is 1. When we refer to individual scores falling on this distribution, we would convert them to this standardized metric by subtracting the mean from the score and dividing by the standard deviation. For instance, a cognitive ability score of 85 using the values given above would have a standardized score of -1 ((85-100)/15). A person with a score of 100 would have a standardized score of 0.

We call these standardized scores Z-scores. Our scores are now in standard deviation units - it tells how many standard deviations the score is from the mean (standard deviation units to be standardized). As long as you know the population mean and standard deviation, you can convert any score to a Z-score. Why would you want to do that? Well, not only do we know what proportion of scores falls within a certain range, but we can tell you exact proportions for any Z-score.

You can find Z-distribution tables or calculators online that would give you these proportions. For instance, this page shows a table of Z-scores between -4.09 and 0, and also has a script at the bottom of the page where you can enter in Z-scores and see where it falls on the curve. For instance, if you wanted to find out someone's percentile rank (the proportion whose score is equal or less than the target), you'd leave the left bound box at -inf, and enter a Z-score into the right bound box. We compute this score quite often at work, since we're working with normed tests. We just compute a Z-score from the mean and standard deviation, and we can quickly look up their percentile rank. You can also use this script to look at what proportion fall between certain scores (by setting both left and right bounds to a Z-score), or what proportion of scores will be higher than a certain Z-score (by entering a Z into the left bound box and setting the right bound box to +inf).

When we run statistical tests on population data - for instance, looking at differences between groups when population values are known - we use the standard normal distribution (also called the Z-distribution, especially in this context), to get our p-value. If we aren't working with samples, we don't worry about things like standard error and correcting for sample size. We have the entire population, so there's no sampling bias, because there's no sampling.

When we start working with samples, we can no longer use the Z-distribution, but rather the t-distribution, which is based on Z. In fact, in my post of the Law of Large Numbers, I shared this picture, which should make more sense to you now:

Z isn't just used in this context though. Z-scores are also part of how Pearson's correlation coefficient (r) is computed. For the first two steps in computing a correlation coefficient by hand (or by a computer, you just don't see any of the steps), scores are converted to Z-scores and each pair of scores is multiplied together. This is how we get positive or negative correlations. If in general pairs of scores are either both negative (below the mean) or both positive (above the mean), the correlation will be positive (as one increases so does the other). But if in general pairs of scores are flipped, where one is negative (below the mean) and other positive(above the mean), the correlation will be negative.

Thanks for joining me this month! I hope you've enjoyed yourself and maybe learned something along the way!