Sunday, May 28, 2017

Statistics Sunday: Getting Started with R

For today's post, I'm going to get you started with using R. This will include installing, importing data from an external file, and running basic descriptives (and a t-test, because we're fancy).

But first, especially for statistics newbies, you may be asking - what the heck is R?

R is an open source statistical package, as well as the name of the programming language used to run analysis (and do some other fancy-schmancy programming stuff we won't get into now - but I highly recommend David Robinson's blog, Variance Explained, to see some of the cool stuff you can do with R). R comes with many statistical and programming commands by default, part of what's called the 'base' package. You can add to R's statistical capabilities by installing different libraries. Everything, including new libraries and documentation about these libraries, is open source, making R an excellent choice for independent scholars, students, and anyone else who can't blow lots of money on software.

R will run on multiple operating systems, so whether you're using Windows, Mac OS, or a distro of Linux, you'll be able to install and run R. To install R, navigate over to the Comprehensive R Archive Network (CRAN). Links to install are available at the top of the page. I just reinstalled R on my Mac, with the newest version (at the time of this writing) called "You Stupid Darkness" (aka: 3.4.0). If and when you write up any statistical analysis you did on R, you'll want to report which version you used (this is true anytime you use software to run analysis, not just when you use R).

After you install R, you'll also want to install R Studio. It's an excellent resource regardless of whether you're new to R or an advanced user.


R Studio organizes itself into four quadrants:
  1.  Upper left - Any R scripts or markdown (for LaTex lovers, like myself - future post!) files are displayed here. Code you write here can be saved for future use. Add comments (starting the line with #) to include notes with your code. This is great if you (or someone else) will revisit code later, and it's also helpful to remind yourself what you did if and when you write up your results. Highlight the code you want to run and click 'Run' to send it to the console.
  2. Lower left is the console. This is where active commands go. If you run code from a script above, it will appear here along with any output. You can also type code directly here but note that you can't save that code for later use/editing.
  3. Upper right records any variables, lists, or data frames that exist in the R workspace (that is, anything you've run that creates an object). There's also a history tab that displays any code you ran during your current session.
  4. Lower right is the viewer. You can view (and navigate through) folders and files on your computer, packages (libraries) installed, any plots you've created, and help/documentation.
The great thing about R Studio is that you can access many things by clicking instead of typing into the console, which is all you get if you were to directly open R instead of R Studio. For some things, you'll find typing code is faster - such as to change your working directly, or load or install libraries. In fact, when I first started using R regularly, I was installing 4-5 libraries a day, which I briefly considered (half-jokingly) using as a measure of productivity. Now that I've reintalled R on my Mac (because I completely wiped the hard drive and reinstalled - long story), I could actually collect these data instead of just joking about doing so.

But when you have to go through multiple steps for certain things - such as viewing the help for a specific command within a specific library - you'll find R Studio makes it much easier.

R Studio will also do some auto-complete and pop-help when you type things into the script window, which is great if you can't quite remember what a command looks like. It can also tell when you're typing the name of a dataset or variable and will pop up a list of active data and variables. Super. Helpful.

Hopefully you were able to install these two programs (and if you haven't done so yet because you've been distracted by this love letter to R Studio fantastically written post, do that now). Now, open R Studio - R will automatically load too, so you don't need to open both. By default, the whole left side of the screen will be console. Create a new script (by clicking the icon that looks like a white page with a green plus and selecting R Script or by clicking File -> New File -> R Script) and the console will move down to make room.

The first thing I always do in a new script is change the working directory. Change it to whatever folder you'll be working with for your project - which can vary depending on what data you're working with. For now, start by downloading the Caffeine study file (our fictional study about the effect of caffeine on test performance, first introduced here), save it wherever you want, then change the working directory to that folder by typing setwd("directory") into the script (replacing directory with wherever the file is saved - keep the quotes and change any \ to /).  (If you really don't want to type that code, in the lower right viewer, navigate to where you saved the file, then click More -> Set As Working Directory. The code you want will appear in the console, so you can copy and paste it into the script for future use.

Let's read this file into R to play with. The file is saved as a tab-delimited file. R base has a really easy command for importing a delimited file. You'll want to give the dataset a name so you can access it later. Here's what I typed (but you can name the first part whatever you'd like):

caffeine<-read.delim("caffeine_study.txt", header=TRUE, sep="\t")

You've now created your first object, which is a dataframe called "caffeine." The command that follows the object name tells R that the file has variable names in the first row (header=TRUE) and that the delimiter is a tab (sep="\t"). Now, let's get fancy and run some descriptive statistics and a t-test, recreating what you saw here. But let's make it easy on ourselves by installing our first package: the psych package*. Type this (either into your script, then highlight and Run, or directly into the console): install.packages("psych"). You just installed the psych package, which, among other things, lets you run descriptive statistics very easily. So type and Run this: 

library("psych") (which loads the library you need for...)
describe(caffeine) (or whatever you named your data)

You'll get output that lists the two variables in the caffeine dataset (group and score), plus descriptive statistics, including mean and standard deviation. This is for the sample overall. You can get group means like this:

describeBy(caffeine, group="group")

Now you'll get descriptives first for the control group (coded as 0) and then the experimental (coded as 1). It will still give you descriptives for the group variable, which is now actually a constant, because the describeBy function is separating by that variable. So the mean will be equal to the group code (0 or 1) and standard deviation will be 0. You should have group 0 M = 79.27 (SD = 6.4) and group 1 M = 83.2 (SD = 6.21). Now, let's run a t-test. R's base package can run a t-test: t.test(DV ~ IV, data=nameofdata). So with the caffeine dataset it would be:

t.test(score ~ group, data=caffeine)

If your data are normally distributed, you'll get a standard Student t. Otherwise, you'll get a Welch's t, which shifts your degrees of freedom slightly to account for lack of normally distributed y variable (future post!). Apparently, my data weren't normal, so I got Welch's for my output. That's what I get for fabricating a dataset - oh yeah, as I said previously, these data are fake, so don't try to publish or present any results.

R can read in many different types of data, including fixed width files, and files created by different software (such as SPSS files). Look for future posts on that. And R can go both ways - not only can it read a tab-delimited file, it can write one too. For instance, if you're doing a lot of different transformations or computing new variables, you might want to save your new datafile for later use. I've also used this command to write results tables to a tab-delimited file I can then import into Excel for formatting. You'll need to reference whatever you named the command, so if you wanted to write your descriptives to a tab-delimited file, you'd need to name the object:

desc<-describe(caffeine)

Note that above, we just typed the describe command in directly, so you'll want to rerun it with a name and the arrow (<-). (This is, in my opinion, the easiest way for a new R user, but there is a way to do all of this in one step that we can get into later.) Now, write the descriptives to a tab-delimited file:

write.table(desc, "desc.txt", row.names=FALSE, sep="\t")

Without the row.names command, R will add numbers to each row. This might be helpful when writing data to a tab-delimited file (it basically gives you case numbers) but I tend to suppress this, mostly because I almost always give my cases some kind of ID number from the beginning. 

One note for any R-savvy readers of this post - the sep command technically isn't needed in either the read.delim or write.table commands, because tab ("\t") is actually the default, but I include just to be clear what delimiter I'm using, and so you get used to specifying. After all, you might need to use a comma delimiter or something else in the future.

Hopefully this has given you enough to get started. You can view help files for different packages by going to the packages tab in the lower right, then clicking on the package name. Scroll through the different commands available in that package and click on one to see more info about it, including sample code. I hope to post some new R tutorials soon! And let me know in the comments if you have any questions about anything.

*Check out William Revelle's page for great resources about the psych package (which he created) and R in general.

No comments:

Post a Comment