## Thursday, May 11, 2017

### Bonus Statistics Post #1: A Note on Notation

So I've decided to just go for it and write some additional posts on statistical topics. If you have any questions or requests, you can add them in the comments below. Statistics are everywhere and you're welcome to share articles or topics and just ask for clarification or some analysis. Honestly, ask about anything statistical.

I think everyone has something that they are really into, really fixated on, that almost no one else cares about. Not only will they geek out about it, they get really frustrated when they perceive people approaching it in the wrong way. For me that thing is statistical notation. Yeah, I know, kind of weird. I doubt there are many other people who care as much about statistical notation. And I get really frustrated about the lack of (or perhaps perceived lack of) standardized notation in statistics.

Why do I care about this? I think for the same reason I love statistics - I appreciate rules. When dealing with variables and operators and slopes, rules help me understand how they should behave. It allows me to concretize abstract concepts. Statistics is very much about rules. For many statistical tests, we call them assumptions - the rules you have to follow if you want to use that statistical test with your data. I like rules. At least when it comes to math.

In the rest of my life, the only rule is there are no rules... I'm not at all convincing, am I?

To give you an idea about the issue, in graduate school, I was a teacher's assistant for an undergraduate statistics course and a student in a graduate statistics course at the same time. Both classes used a textbook by the same author. I won't name names because I'm a nobody and can't afford to throw shade at a somebody just yet. But same author, no co-authors, and basically the same topics covered, though the graduate one was a bit more in-depth.

The notation wasn't even consistent between these two books.

I didn't realize this until I was asked to lecture for the undergraduate class, and had to deal with many confused looks from students when the notation I wrote on the board was different from what was in their books.

This all could have been solved if we had freaking standardized notation. Just sayin'.

But I'd been thinking about this, now that I'm toying with the idea of writing a book about statistics. I've always had my own philosophy about notation; I've just never articulated it before.

I think the most important component of my approach to notation is to keep in mind what we are trying to accomplish. Whenever we collect and analyze data, we are using that data to represent populations. Sometimes that data comes from whole populations, and sometimes it comes from samples. As far as I've seen, the usual approach is using Greek letters to represent population values (what we call parameters). It's the sample values (what we call statistics) that are notationally inconsistent (even within a single author apparently). But when we collect data from a sample, we are using them as a stand-in for the population. Statistics notation should reflect that connection back to parameters. My approach then is to use the equivalents of the Greek letters:

For instance, population standard deviation is represented by σ (sigma), so I use s to symbolize the sample standard deviation. (Always lowercase because, as you'll learn as you dig into statistics, capital letters also have a separate meaning.)

I've always felt this way about notation, but I really thought about it much more when I started learning structural equation modeling. Just like statistics more generally, there are multiple notation techniques in SEM. However, unlike statistics, there are a limited number of notational approaches and they're all clearly labeled (e.g., the LISREL approach, the M-Plus approach, etc.). I learned LISREL, which uses many of the Greek letters you see above, and this is the approach I prefer, in part because it's how I learned and we tend to think the first way is the right way (a primacy effect) but also in part because I recognize the connection between the analyses I conduct and the Greek letters that tie those analyses back to the population values we're attempting to estimate.

I'm mostly sharing this information to explain why I approach it the way I do and also as a warning that if you decide to learn statistics, the notation you learn may be wildly different depending on whose book you read, and I, for one, think that's a travesty. And as I continue writing about statistics, I'll probably start dropping in some notation. I'll explain that notation when I get there, but this will at least help you understand why I do it the way I do.