Friday, April 20, 2018

R is for R Origin Story

An important place in the history of statistics is AT&T Bell Laboratories. And one of the key parts of that story is the development of a language for statistical computing called S.

Prior to 1975 or so, statistical researchers at Bell Labs used Fortran for their statistical computing. But from 1975 to 1976, John Chambers, Rick Becker, and Allan Wilks developed a language written in Fortran that would allow for more interactive analysis. Though they threw around many names, including SCS (Statistical Computing System), they settled on S - this is, after all, the same institution that brought you the C programming language, so single letter names were kind of a thing. In 1988, they created New S, switching some of the internal functions from Fortran to C.

Out of S grew S-Plus, a commercial statistical computing language, which arrived on the scene also in 1988. And in 1993, an open source version S and S-Plus appeared: R. The syntax for New S, S-Plus, and R are very similar - in fact, much of the syntax will run on all three of these languages. But being open source, anyone who wants R can access it and its source code, under the GNU General Public License.

BTW, these are my GNU coding buddies - their names are Gus (left) and Gary (right), and yes, I can tell the difference between them. Gary has a wider face.
R is an interpreted language with features of object-oriented and functional programming. It, of course, can be used as a big calculator, but there are much more powerful things you can do with R. Personally, I haven't even begun to scratch the surface of some of these capabilities - both on this blog and as an R user. For instance, I still struggle with some of the programming concepts like creating loops.

R is maintained by the R Development Core Team, and is a not-for-profit organization. Because of the terms of the GNU GPL, anyone who wants can access the source code, make changes, and even distribute their version - though because of the terms of the GPL, they also must make their altered source code available. The purpose is to put power back in the hands of the people using the software and language. And anyone who wants can develop R packages that extend the capabilities of R.

I've had colleagues criticize this open source nature of R, because of concerns about quality issues. Anyone can write an R package. But this is a collaborative project and anyone can access source code, meaning if there are issues, someone is very likely to find them and fix them (or weed out the bad). Not to mention - knock on wood - but I've never had any major issues with bugs in R, and yet I've purchased expensive proprietary software riddled with bugs. (Some of us may remember the SPSS version 16 release, which would often refuse to save or save then delete your data files. Sometimes you wouldn't know it until you went to open your data, only to find the file missing. That was probably what made me finally dump SPSS back in 2009. And it's also why, when I went back to using proprietary statistical software during my time at VA, I opted to learn Stata on my own than go back to SPSS. I use SPSS once or twice a week in my current job, but I'm trying to move us toward using R whenever possible.)

What are some things you should know about working with R?
  • When writing up results for publication, always clearly state that you used R for analysis and the exact version number. I'm currently running 3.4.3, but I believe 3.4.4 recently dropped. Be sure to provide a citation in your writeup as well.
  • Always state the exact R packages you used and provide citations for those as well - if there is a paper about the package available in a scholarly journal, cite that. For example, if you used lavaan, you'd want to provide a citation to this paper.
  • Write code for everything you do in R, especially when working with a single dataset, and save it as an R script. This keeps you from losing track of how you did things that you'll very likely have to recreate at some point, and it's invaluable to be able to lift from your own code for analyses on different datasets. Part of my motivation of writing this particular blog series is so I can easily refer back to posts to remind myself how to code something specific.
  • When using code someone else has written, change things. Break stuff. See what happens if you change the name of something or the number of loops or whatever. This is how you learn what parts of the code are fixed and what you can change for your needs. And of course, try combining code from different sources. You may end up with perfectly functioning code; you may end up with Frankenstein's monster. Either way, you'll learn something.
  • If you can't determine the meaning of an error message, try Googling it verbatim. Chances are, someone has written about that particular error.
  • Know that there is a steep learning curve and you will make mistakes and/or end up frustrated. Stick to it and know that there's a huge community out there willing to help you.
Sound off, R users: what do you think people should know about working with R?

No comments:

Post a Comment