Monday, April 23, 2018

T is for tibble

T is for Tibble For the letter D, I introduced data frames, a built-in R object type. But as I've learned more about R and, in particular, the tidyverse - most recently when I finally started reading Text Mining with R: A Tidy Approach - I learned about a more modern version of the R data frame: a tibble.

According to the tibble overview on the tidyverse website:
Tibbles are data.frames that are lazy and surly: they do less (i.e. they don't change variable names or types, and don't do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code.
What does this mean? Well, remember when I noted that a character variable in my measures data frame had been changed to a factor? I manually changed it back to character. But had I simply created a tibble with that information, I wouldn't have had to do anything. Data frames will also do partial matching on variable names - so if I requested Facebook$R, it would have given me all variables in that set starting with R. If I tried that with a tibble, I'd get an error message, because it matches variable references literally.

There are a few ways to create a tibble, one using the tibble packages and the other using the readr package. Fortunately, you don't need to worry about that, because we're just going to use the tidyverse package, which contains those two and more.

install.packages("tidyverse")
## Installing package into '~/R/win-library/3.4'
## (as 'lib' is unspecified)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

First, let's create a new tibble from scratch. The syntax is almost exactly the same as it was in the data frame post.

measures<-tibble(
  meas_id = c(1:6),
  name = c("Ruminative Response Scale","Savoring Beliefs Inventory",
           "Satisfaction with Life Scale","Ten-Item Personality Measure",
           "Cohen-Hoberman Inventory of Physical Symptoms",
           "Center for Epidemiologic Studies Depression Scale"),
  num_items = c(22,24,5,10,32,16),
  rev_items = c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
)
measures
## # A tibble: 6 x 4
##   meas_id                                              name num_items
##     <int>                                             <chr>     <dbl>
## 1       1                         Ruminative Response Scale        22
## 2       2                        Savoring Beliefs Inventory        24
## 3       3                      Satisfaction with Life Scale         5
## 4       4                      Ten-Item Personality Measure        10
## 5       5     Cohen-Hoberman Inventory of Physical Symptoms        32
## 6       6 Center for Epidemiologic Studies Depression Scale        16
## # ... with 1 more variables: rev_items <lgl>

As you can see, the name variable is character, not factor. I didn't have to do anything. Alternatively, you could convert an existing data frame, whether it's one you created or one that came with R/an R package.

car<-as_tibble(mtcars)
car
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
##  2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
##  3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
##  4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
##  5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
##  6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
##  7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
##  8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
##  9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
## # ... with 22 more rows

But chances are you'll be reading in data from an external file. The readr package can handle delimited and fixed width files. For instance, to read in the Facebook dataset I've been using, I just need the function read_tsv.

Facebook<-read_tsv("small_facebook_set.txt",col_names=TRUE)
## Parsed with column specification:
## cols(
##   .default = col_integer()
## )
## See spec(...) for full column specifications.
Facebook
## # A tibble: 257 x 111
##       ID gender  Rum1  Rum2  Rum3  Rum4  Rum5  Rum6  Rum7  Rum8  Rum9
##    <int>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     1      1     3     1     3     2     3     1     2     1     1
##  2     2      1     1     1     1     1     1     1     0     0     1
##  3     3      1     4     3     3     4     3     4     2     3     3
##  4     4      0     4     0     0     2     0     0     4     0     2
##  5     5      1     2     2     2     1     2     1     1     1     1
##  6     6      0     2     4     3     4     2     3     2     2     3
##  7     7      1     1     2     3     2     0     2     3     1     2
##  8     8      0     2     1     1     2     0     2     3     3     3
##  9     9      1     4     1     4     4     3     2     2     1     1
## 10    10      1     4     2     0     3     4     2     4     1     2
## # ... with 247 more rows, and 100 more variables: Rum10 <int>,
## #   Rum11 <int>, Rum12 <int>, Rum13 <int>, Rum14 <int>, Rum15 <int>,
## #   Rum16 <int>, Rum17 <int>, Rum18 <int>, Rum19 <int>, Rum20 <int>,
## #   Rum21 <int>, Rum22 <int>, Sav1 <int>, Sav2 <int>, Sav3 <int>,
## #   Sav4 <int>, Sav5 <int>, Sav6 <int>, Sav7 <int>, Sav8 <int>,
## #   Sav9 <int>, Sav10 <int>, Sav11 <int>, Sav12 <int>, Sav13 <int>,
## #   Sav14 <int>, Sav15 <int>, Sav16 <int>, Sav17 <int>, Sav18 <int>,
## #   Sav19 <int>, Sav20 <int>, Sav21 <int>, Sav22 <int>, Sav23 <int>,
## #   Sav24 <int>, LS1 <int>, LS2 <int>, LS3 <int>, LS4 <int>, LS5 <int>,
## #   Extraverted <int>, Critical <int>, Dependable <int>, Anxious <int>,
## #   NewExperiences <int>, Reserved <int>, Sympathetic <int>,
## #   Disorganized <int>, Calm <int>, Conventional <int>, Health1 <int>,
## #   Health2 <int>, Health3 <int>, Health4 <int>, Health5 <int>,
## #   Health6 <int>, Health7 <int>, Health8 <int>, Health9 <int>,
## #   Health10 <int>, Health11 <int>, Health12 <int>, Health13 <int>,
## #   Health14 <int>, Health15 <int>, Health16 <int>, Health17 <int>,
## #   Health18 <int>, Health19 <int>, Health20 <int>, Health21 <int>,
## #   Health22 <int>, Health23 <int>, Health24 <int>, Health25 <int>,
## #   Health26 <int>, Health27 <int>, Health28 <int>, Health29 <int>,
## #   Health30 <int>, Health31 <int>, Health32 <int>, Dep1 <int>,
## #   Dep2 <int>, Dep3 <int>, Dep4 <int>, Dep5 <int>, Dep6 <int>,
## #   Dep7 <int>, Dep8 <int>, Dep9 <int>, Dep10 <int>, Dep11 <int>,
## #   Dep12 <int>, Dep13 <int>, Dep14 <int>, Dep15 <int>, Dep16 <int>

Finally, if you're working with SAS, SPSS, or Stata files, you can read those in with the tidyverse package, haven, and the functions read_sas, read_sav, and read_dta, respectively.

If for some reason you need a data frame rather than a tibble, you can convert a tibble to a data frame with class(as.data.frame(tibble_name)).

You can learn more about tibbles here and here.

No comments:

Post a Comment