Wednesday, April 4, 2018

D is for Data Frame

Title Working in R involves dealing with various objects, such as functions and a variety of data structures. One of the objects I work with the most in R is the data frame, a data table in which rows are cases and columns are variables (in the research, rather than programming, sense of the word - while I tend to fall into research speak, column is probably the better word to use in this context). Unlike some other data structures in R, the columns can be of different types - for instance, one column may contain string (text) data, another numeric, another Boolean indicator (TRUE, FALSE), and so on. R and its various libraries comes with many built-in data frames. And creating one with your own data is very easy and can be accomplished in multiple ways. A data frame is then used in statistical analysis. Today, I'll show you some of the main methods for creating data frames from scratch or reading them in from other sources.

Creating a Data Frame From Scratch

Creating a data frame from scratch involves binding together other R objects, such as matrices, arrays, or vectors. This makes sense if you have raw data not yet entered into some other type of data object or, as I've done in some past R posts, when you're generating data. Each row name must be unique, and the same goes for each column, though the name could be as simple as a number. For instance, in my alpha and beta posts, I referred to columns in my Facebook data frame by their column number: e.g., Facebook[,3], which references column 3. (Note: You can use that same notation to refer to row numbers; just put that information within the brackets and before the comma. For instance, Facebook[3,] references row 3. Leaving the space before the comma blank includes all rows, and leaving the space after the comma blank includes all columns, in the data frame.) I could have instead referred by column name, with the dataset$variable format, but number is easier if you're referencing a range of columns.

You can assign whatever names you want - for instance, in my Facebook data frame, column names were pulled from the header in the tab-delimited file - but any duplicates will give you an error. By default, R assigns row names as its number in the data frame, but I could also set that myself; for instance, I might want to assign my unique ID variable as my row names:

Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)

Another important thing to remember is that each column in an R data frame must have the same number of values (rows). Those values can be missing, but there still has to be something there, or else R will give you an error when you try to create a data frame.

  meas_id = c(1:6),
  score = c(1,2,5,4,5)
## Error in data.frame(meas_id = c(1:6), score = c(1, 2, 5, 4, 5)): arguments imply differing number of rows: 6, 5

So let's start with the first way to create a data frame from scratch - manually type in values. This approach makes sense when the number of rows and columns is small. Let's create a data frame listing the measures I used in my Facebook study, which I'm using as an ongoing example this month:

  meas_id = c(1:6),
  name = c("Ruminative Response Scale","Savoring Beliefs Inventory",
           "Satisfaction with Life Scale","Ten-Item Personality Measure",
           "Cohen-Hoberman Inventory of Physical Symptoms",
           "Center for Epidemiologic Studies Depression Scale"),
  num_items = c(22,24,5,10,32,16),
  rev_items = c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)

This code creates a data frame of 6 rows and 4 columns. I've assigned a measurement id, and given the name of the scale, the number of items, and an indicator of whether the scale has any reversed items. Also, I'd like to point out that I use equal signs with the data.frame code. This not only creates that column, it gives the column that name. Using arrows <- instead gives weird column names. If you're curious - hey, the best way I learn is through a combination of curiosity and making (sometimes purposeful) mistakes - change = to <- and see what happens.

Each column is a different data type. I can ask R what the data type is for a specific variable with str followed by the dataset$variable in parentheses, or for all variables with str(dataset):

## 'data.frame': 6 obs. of  4 variables:
##  $ meas_id  : int  1 2 3 4 5 6
##  $ name     : Factor w/ 6 levels "Center for Epidemiologic Studies Depression Scale",..: 3 5 4 6 2 1
##  $ num_items: num  22 24 5 10 32 16
##  $ rev_items: logi  FALSE TRUE FALSE TRUE FALSE TRUE

Int refers to integer, num to number, and logi to Boolean or logical. Strangely, R thinks the measurement name column is a factor, which isn't what I wanted. I can fix this by forcing R to make the column a character vector:
##  chr [1:6] "Ruminative Response Scale" "Savoring Beliefs Inventory" ...

On the other hand, if I'm generating random data, I would write the code to create my data then bind them together into a data frame, again enclosing that code with data.frame(). For examples on how I've done this, see here, here, and here.

Reading Data into R

It's more likely that you already have data in some other form and want to read it into R. I'm a big fan of tab-delimited and CSV files, both of which are easy to create, small and compact even with a lot of data, and really easy to read in. I recommend including variable names in the first row of the file, though you can add them later if they're not there. Reading in a tab-delimited file is easy with the read.delim function, which automatically generates an R data frame - just remember to name it so it will store it in the R environment:

mytabdata <- read.delim("file.txt", header=TRUE)

If you're working with a CSV file, it has it's own function, read.csv:

mycsvdata <- read.csv("file.csv", header=TRUE)

If your files are not saved in the working directory, put the entire path within the file quotes - or better yet, change the working directory with setwd("path"). The defaults for these two functions are tab-separator and comma-separator, respectively, so you don't have to specify, though you could with sep="/t" and sep=",", respectively. These two also default to character data enclosed in quote-marks "". 

If you're accessing a tab-delimited or CSV file saved at a url, just put url(webaddress) where I currently have "file.txt" or "file.csv". There are also ways to read HTML tables into data frames, but I think that's another post for another day.

And if your file does not contain variables in the first line, leave out header=TRUE (or set to header=FALSE). If it does, but you'd rather not use those, you can tell R to skip those by adding skip=# (number of rows to skip). Then add in column names as a vector, with each name enclosed in quote marks and separated by commas - make sure you provide as many names as there are columns or you'll get an error. For instance, if I had instead read in the measure information from above and needed to provide column names separately:


You can also load a pre-existing dataset in R, and this is frequently done to show examples of what R can do. If a dataset is part of base R, you can load with data(name). A popular example dataset in R is mtcars - a dataset taken from the 1974 Motor Trend US magazine. Sometimes, the data(name) code loads the set but doesn't make it a dataframe. You can quickly coerce it into a dataframe by requesting the head (column names plus first several rows):

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Applying a function to the data will also usually coerce it into a data frame. But the head(data) function is a quick way to do that, and is nice if you want to view the data. 

On the other hand, if a dataset is part of an R library, you need to load that library first. For instance, the metafor package I'll be using tomorrow, which is used to conduct meta-analysis, comes with many built-in datasets. The McDaniel 1994 dataset contains studies examining the validity of employment interviews. You can pull it up in the same way as mtcars, after loading (install first if you haven't before) the metafor package:

## Installing package into '\\marge/users$/slocatelli/My Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## Loading required package: Matrix
## Loading 'metafor' package (version 2.0-0). For an overview 
## and introduction to the package please type: help(metafor).
##   study   ni   ri type struct
## 1     1  123 0.00    j      s
## 2     2   95 0.06    p      u
## 3     3   69 0.36    j      s
## 4     4 1832 0.15    j      s
## 5     5   78 0.14    j      s
## 6     6  329 0.06    j      s

To keep this post from getting obscenely long, I'll wrap up here, but there are a few other ways to read files into data frames, which I hope to blog about later. If you're working with SPSS, SAS, Stata, or other data from statistical packages, you can read them in with the Hmisc (SPSS and SAS) or foreign (all of the above, though not as powerful as Hmisc) package. If you're working directly with Excel files, and don't want to convert to tab-delimited or CSV first, I recommend the XLConnect package, which also lets you move R objects, including graphics, over to Excel.

Finally, you can read in other file types of fixed-width files, XML, or JSON - which have a variety of applications in large-scale testing, particularly computer-adaptive testing. I worked with fixed-width files quite a bit at HMH, since data for some of our cognitive ability tests was sent to us in fixed-width files, and a colleague there frequently worked with JSON, which was used for some of our large-scale computer-adaptive and computer-administered tests. Look for future blog posts on those (and what they are, if these are new terms for you), since there are some nuances to the code that require a bit more depth.

No comments:

Post a Comment