Sunday, October 8, 2017

Statistics Sunday: Why Is It Called 'Data Science'?

In one of the Facebook groups where I share my statistics posts, a user had an excellent question: "Why is it called data science? Isn't any science that uses empirical data 'data science'?"

I thought this was a really good point. Calling this one field data science implies that other scientific fields using data are not doing so scientifically or rigorously. And even data scientists recognize that there's a fair amount of 'art' involved in data science, because there isn't always a right way to do something - there are simply ways that are more justified than others. In fact, I just started working through this book on that very subject:

What I've learned digging into this field of data science, in the hopes of one day calling myself a data scientist, is that statistics is an integral part of the field. Further, data science is a team sport - it isn't necessary (and it may even be impossible) to be an expert in all the areas of data science: statistics, programming, and domain knowledge. As someone with expertise in statistics, I'm likely better off building additional knowledge in statistical analysis used in data science, like machine learning, and building up enough coding knowledge to be able to understand my data science collaborators with expertise in programming.

But that still doesn't answer our main question: why is it called data science? I think what it comes down to is that data science involves teaching (programming) computers to do things that once had to be done by a person. Statistics as a field has been around much longer than computers (and I mean the objects called computers, not the people who were once known as computers). In fact, statistics has been around even prior to mechanical calculators. Many statistical approaches didn't really need calculators or computers. It took a while, but you could still do it by hand. All that was needed was to know the math behind it. And that is how we teach computers - as long as we know the math behind it, we can teach a computer to do just about anything.

First, we were able to teach computers to do simple statistical analyses: descriptives and basic inferential statistics. A person can do this, of course; a computer can just do it faster. We kept building up new statistical approaches and teaching computers to do those analyses for us - complex linear models, structural equation models, psychometric approaches, and so on.

Then, we were able to teach computers to learn from relationships between words and phrases. Whereas before we needed a person to learn the "art" of naming things, we developed the math behind it and taught it to computers. Now we have approaches like machine learning, where you can feed in information to the computer (like names of paint shades or motivational slogans) and have the computer learn how to generate that material itself. Sure, the results of these undertakings are still hilarious and a long way away from replacing people, but as we continue to develop the math behind this approach, computers will get better.

Related to this concept (and big thanks to a reader for pointing this out) is the movement from working with structured data to unstructured data. Once again, we needed a person to enter/reformat data so we could work with it; that's not necessary anymore.

So we've moved from teaching computers to work with numbers to words (really any unstructured data). And now, we've also taught computers to work with images. Once again, you previously needed a person to go through pictures and tag them descriptively; today, a computer can do that. And as with machine learning, computers are only going to get better and more nuanced in their ability to work with images.

Once we know the math behind it, we can teach a computer to work with basically any kind of data. In fact, during the conference I attended, I learned about some places that are working with auditory data, to get computers to recognize (and even translate in real-time) human languages. These were all tasks that needed a human, because we didn't know how to teach the computers to do it for us. That's what data science is about. It still might not be a great name for the field, but I can understand where that name is coming from.

What are you thoughts on data science? Is there a better name we could use to describe it? And what do you think will be the next big achievement in data science?


  1. thanks, Sara, and I agree that 'data science involves teaching computers to do things that once had to be done by a person'. This implies it should be renamed, e.g. 'machine learning' or better 'machine teaching'.