Thursday, April 9, 2020

H is for haven

The tidyverse includes many packages meant to make importing, wrangling, analyzing, and visualizing data easier. The haven package allows you to import files from other statistical software, such as SPSS, SAS, and Stata. I learned SPSS in college and used it extensively in grad school. I ended up switching to R because SPSS was getting expensive to buy on my own, and I wanted to be able to analyze my dissertation data without traveling to campus each time. I'm thankful I had the bravura to teach myself R back in 2009, because it's become one of my strongest and most marketable skills, and I've had a lot of time and opportunity to hone that skill.

But as with many people who have worked in the professional research world, I still work with and have access to files built with programs like SPSS. Fortunately, I can still access those files at home on my personal laptop, thanks to haven.

Since I have access to SPSS through work, I was able to save my 2019 reads datafile in SPSS format, to give you something to test out the haven package with. You can download that file here - you won't need SPSS to access it, because R will do that for us.

haven is part of tidyverse, so installing the tidyverse will give you that package. However, it isn't part of the core tidyverse, meaning the library(tidyverse) command won't load it automatically, so you'll need to load it separately. I'll still load tidyverse, though, mainly because it's typically the first library load when I start analyzing data in R, since the core functions are so frequently used in my code.
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

spssreads <- read_spss("~/Downloads/Blogging A to Z/SaraReads2019_allrated.sav")

## # A tibble: 6 x 18
##   Title Pages date_started date_read  Book.ID Author AdditionalAutho…
##   <chr> <dbl> <date>       <date>       <dbl> <chr>  <chr>           
## 1 1Q84    925 2019-09-03   2019-09-10  1.04e7 Murak… "Jay Rubin, Phi…
## 2 A Di…   256 2019-08-21   2019-08-22  5.46e4 Kalfu… ""              
## 3 Alas…   323 2019-12-21   2019-12-23  3.82e4 Frank… ""              
## 4 Arte…   305 2019-04-08   2019-04-11  3.49e7 Weir,… ""              
## 5 Bird…   262 2019-02-07   2019-02-13  1.85e7 Maler… ""              
## 6 Boun…   314 2019-04-23   2019-04-26  9.44e5 Cloud… "John  Townsend"
## # … with 11 more variables: AverageRating <dbl>, OriginalPublicationYear <dbl>,
## #   read_time <dbl>, MyRating <dbl>, Gender <dbl>, Fiction <dbl>,
## #   Childrens <dbl>, Fantasy <dbl>, SciFi <dbl>, Mystery <dbl>, SelfHelp <dbl>
One interesting thing about haven, which I looked up in this github post, is that it does something strange with dates. It converts them to number of seconds since an origin.

Typically that origin is January 1, 1970, and in fact, any date variable you work with in R is (under-the-hood) represented in this way (so-called UNIX time or Epoch time). But haven uses a different origin: October 15, 1582. What is the significance of this date? It's the day we switched over to the Gregorian calendar. You didn't realize you'd be getting a history lesson with this post, did you?

Haven can also be used to write data in these programs' file formats, so if you have a collaborator who wants to use SPSS or Stata to analyze the data, you could create a version in the program's native format.

Tomorrow, we'll talk about resources for learning more about tidyverse. And the day after that, get ready for joins!


  1. I guess the reason why haven uses this origin is the fact that SPSS ('born' in 1968) choose this date as their reference.

    1. Great point! SPSS predates the Epoch, so they had to pick some meaningful date to represent time that wasn't related to UNIX. Funny side-note: My dissertation director used SPSS for analysis of her dissertation data and still had the punch cards to show me. SPSS uses a lot of its lingo in relation to those punch cards (e.g., specifying the column length of the variable, etc.), so it was cool to see the physical representation and explanation for that lingo. I love learning about the history of statistics, analysis, computers, and so on, so thanks for sharing this fact!