Tuesday, April 24, 2018

Bonus Post: Social Network Analysis with the Battles of A Song of Ice and Fire

Network Analysis with A Song of Ice and Fire As promised, here's the network analysis and graph from the Battles dataset I introduced in this post. I based on my code on code available in this post. First up, we'll load the Battles dataset.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
Battles<-read_csv('https://raw.githubusercontent.com/chrisalbon/war_of_the_five_kings_dataset/master/5kings_battles_v1.csv')
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   year = col_integer(),
##   battle_number = col_integer(),
##   major_death = col_integer(),
##   major_capture = col_integer(),
##   attacker_size = col_integer(),
##   defender_size = col_integer(),
##   summer = col_integer()
## )
## See spec(...) for full column specifications.

Next, we'll prepare our data for the network model, which involves extracting the attacker and defender kings, and labeling them as such.

attacker_king<-Battles %>%
  distinct(attacker_king) %>%
  rename(label=attacker_king)

defender_king<-Battles %>%
  distinct(defender_king) %>%
  rename(label=defender_king)

nodes<-full_join(attacker_king,defender_king,by="label")

nodes<-nodes %>% rowid_to_column("id")

Now we can create the object that will be used to create the figure. We want the figure to contain each of the kings, with lines between them indicating that they battled each other. The thickness of the line will be determined by the number of battles involving those kings.

per_battles<-Battles %>%
  group_by(attacker_king,defender_king) %>%
  summarise(weight=n()) %>%
  ungroup()

edges<-per_battles %>%
  left_join(nodes, by=c("attacker_king" = "label")) %>%
  rename(from=id)

edges<-edges %>%
  left_join(nodes,by=c("defender_king" = "label")) %>%
  rename(to=id)

edges<-select(edges,from,to,weight)

We have the data we need. Last step is to put it into a table used by the figure, which is created with ggraph.

library(tidygraph)
## Warning: package 'tidygraph' was built under R version 3.4.4
## 
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
## 
##     filter
library(ggraph)
## Warning: package 'ggraph' was built under R version 3.4.4
battles_tidy<-tbl_graph(nodes=nodes, edges=edges, directed=TRUE)
battles_tidy %>%
  activate(edges) %>%
  arrange(desc(weight))
## # A tbl_graph: 7 nodes and 14 edges
## #
## # A directed multigraph with 1 component
## #
## # Edge Data: 14 x 3 (active)
##    from    to weight
##   <int> <int>  <int>
## 1     1     2     10
## 2     2     1      9
## 3     3     2      4
## 4     3     1      2
## 5     1     4      2
## 6     4     1      2
## # ... with 8 more rows
## #
## # Node Data: 7 x 2
##      id label                   
##   <int> <chr>                   
## 1     1 Joffrey/Tommen Baratheon
## 2     2 Robb Stark              
## 3     3 Balon/Euron Greyjoy     
## # ... with 4 more rows
ggraph(battles_tidy, layout = "graphopt") +
  geom_node_point() +
  geom_edge_link(aes(width = weight), alpha = 0.8) +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = label), repel = TRUE) +
  labs(edge_width = "Battles") +
  theme_graph()

Most of the battles in the dataset were between Joffrey/Tommen and Robb Stark. We have a blank spot in our figure; this refers to a handful of battles not involving kings, as well as one between Joffrey/Tommen and the Brave Companions. Here's the figure again, without the NAs.

U is for (Data From) URLs

Working with URLs in R Up to now, we've been working with files saved locally on your computer. But that limits you to files that can be easily saved to your computer and, up to now, structured data. As we move from pure statistics to a data science approach, more and more, we'll be working with data stored somewhere in the cloud. Today, I'd like to introduce how to begin working with data available from URLs. There are many directions you can take, depending on the type of data you're working with and how you plan to analyze it, so we'll only be scratching the surface of what could lead to multiple more advanced topics.

URL stands for Uniform Resource Locator. It can refer to a webpage (http), file transfer protocol (ftp), email (mailto), and any number of other resources. We'll focus today on collecting data from webpages.

Let's start with the simplest way to access data at a URL - using the same syntax we've used to read delimited files saved locally. For this example, I'm using Chris Albon's War of the Five Kings dataset, which contains data on battles from George R.R. Martin's A Song of Ice and Fire series (Game of Thrones for those of you familiar with the HBO show based on the series.)

The data is in a CSV file, and contains data on 25 variables for 38 battles. Unfortunately, the dataset hasn't been updated since 2014, but it's based on the books rather than the show. It's fine for our present example. Since it's a CSV file, we can use a read CSV function to read the data into an R data frame. (I selected read_csv from the tidyverse, to create a tibble.) Where we would normally put the path for our locally saved file, we use the path to the CSV file saved on GitHub.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
Battles<-read_csv('https://raw.githubusercontent.com/chrisalbon/war_of_the_five_kings_dataset/master/5kings_battles_v1.csv')
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   year = col_integer(),
##   battle_number = col_integer(),
##   major_death = col_integer(),
##   major_capture = col_integer(),
##   attacker_size = col_integer(),
##   defender_size = col_integer(),
##   summer = col_integer()
## )
## See spec(...) for full column specifications.

You should now have a data frame called Battles, containing 38 observations on 25 variables. You could run some simple descriptives on the dataset but a fun analysis might be adopt a network analysis approach, and map out the attacker and defender kings, to show who is attacking whom and demonstrate the insanity that is ASoIaF. Because that kind of nerdy analysis sounded fun, I went ahead and did it, basing my code on code from this post. To keep this post from getting overly long, I made this a separate post, available here.

Another way to access data from URLs is to read in an HTML table. ESPN provides tons of sports-related data on their site, often in the form of HTML tables. Let's practice pulling some of the NCAA Division I men's basketball data, specifically their team scoring per game. This table also provides a few other variables, like percentages of field goals, 3-pointers, and free throws made. One of the best ways to cleanly scrape these data is with the XML package.

library(XML)
scoring_game<-readHTMLTable('http://www.espn.com/mens-college-basketball/statistics/team/_/stat/scoring-per-game/sort/avgPoints/count/', which=1)

The which argument tells which HTML table to read in. There's only 1 on the page, but without specifying, readHTMLTable will read the entire page for all tables and create a list object that contains each table as a separate data frame, embedded within the list (even if it only finds 1 table). But since I wanted to go straight to the data frame without going through the list first, I specified the table to read. Now we have a data frame called scoring_game, with columns named after the column headings from the HTML tables. There are a few quirks to this data frame we want to fix. First, every 10 rows, ESPN adds an additional header row, which R read in as a data row. I want to drop those, since they don't contain any data. Second, each team is ranked, but when there are ties, the rank cell is blank, which R filled in with a special character (Â). Rather than manually cleaning up these data, I'll just have R generate new ranks for this data frame, allowing for ties, and drop the RK variable.

scoring_game<-scoring_game[-c(11,22,33),]
library(tidyverse)
scoring_game %>% mutate(rank = min_rank(desc(PTS)))
##    RK              TEAM GP  PTS   FGM-FGA  FG%   3PM-3PA  3P%   FTM-FTA
## 1   1     UNC Asheville  1 98.0 35.0-75.0 .467 14.0-31.0 .452 14.0-18.0
## 2   2            Oregon  2 95.5 31.0-55.0 .564  9.0-19.5 .462 24.5-32.5
## 3   3           Wofford  1 94.0 34.0-63.0 .540 15.0-31.0 .484 11.0-16.0
## 4   4           Seattle  1 90.0 33.0-72.0 .458 12.0-31.0 .387 12.0-21.0
## 5   5               USC  2 89.0 34.5-70.0 .493 10.0-25.5 .392 10.0-21.0
## 6   Â        Fort Wayne  1 89.0 33.0-60.0 .550 11.0-33.0 .333 12.0-15.0
## 7   7  Central Michigan  3 87.7 30.7-63.7 .482 14.7-36.3 .404 11.7-14.3
## 8   8        Miami (OH)  1 87.0 33.0-60.0 .550 13.0-26.0 .500   8.0-9.0
## 9   9        Seton Hall  2 86.5 28.5-61.0 .467  8.5-22.5 .378 21.0-27.5
## 10 10            Xavier  2 86.0 29.0-56.5 .513  8.0-18.5 .432 20.0-31.0
## 11  Â             Rider  1 86.0 35.0-76.0 .461  6.0-22.0 .273 10.0-16.0
## 12 12     West Virginia  3 85.7 30.7-66.0 .465  7.7-21.3 .359 16.7-21.7
## 13 13 Northern Colorado  4 85.5 30.3-60.8 .498 10.0-26.0 .385 15.0-23.3
## 14 14         Villanova  6 83.8 27.5-58.0 .474 12.7-30.5 .415 16.2-19.8
## 15 15          NC State  1 83.0 28.0-61.0 .459 11.0-30.0 .367 16.0-27.0
## 16  Â             Texas  1 83.0 31.0-67.0 .463 11.0-24.0 .458 10.0-18.0
## 17  Â               BYU  1 83.0 34.0-76.0 .447  6.0-27.0 .222  9.0-15.0
## 18  Â     Virginia Tech  1 83.0 30.0-54.0 .556  9.0-18.0 .500 14.0-20.0
## 19 19         Marquette  3 82.7 27.7-56.0 .494 10.3-23.7 .437 17.0-19.0
## 20 20       North Texas  6 82.5 28.7-60.3 .475  8.7-22.8 .380 16.5-22.8
## 21  Â        Ohio State  2 82.5 28.5-69.0 .413 11.5-33.0 .348 14.0-17.5
## 22 22           Buffalo  2 82.0 30.0-64.5 .465 11.0-30.5 .361 11.0-14.5
## 23 23              Duke  4 81.5 29.3-61.0 .480  8.8-26.5 .330 14.3-19.5
## 24 24            Kansas  5 80.6 28.2-61.6 .458  9.2-23.4 .393 15.0-20.0
## 25 25       Austin Peay  2 80.5 31.5-68.0 .463  4.5-19.5 .231 13.0-21.5
## 26 26       Utah Valley  2 80.0 28.5-56.5 .504  7.5-17.0 .441 15.5-18.5
## 27 27           Clemson  3 79.7 30.0-61.7 .486  7.3-20.0 .367 12.3-17.0
## 28 28  Middle Tennessee  2 79.5 32.0-55.5 .577  7.5-18.0 .417  8.0-11.0
## 29 29        Washington  2 79.0 29.5-57.5 .513  8.5-20.0 .425 11.5-17.5
## 30 30  Western Kentucky  4 78.5 29.0-59.8 .485  5.3-16.5 .318 15.3-19.3
## 31  Â            Baylor  2 78.5 30.5-59.0 .517  6.5-16.0 .406 11.0-17.5
## 32 32    Oklahoma State  3 78.3 24.7-67.0 .368  8.3-22.7 .368 20.7-28.7
## 33 33          Oklahoma  1 78.0 29.0-69.0 .420  4.0-20.0 .200 16.0-24.0
## 34  Â          Bucknell  1 78.0 23.0-55.0 .418 11.0-20.0 .550 21.0-28.0
## 35  Â          Canisius  1 78.0 23.0-61.0 .377  7.0-29.0 .241 25.0-36.0
## 36 36               LSU  2 77.5 27.5-57.0 .482  6.0-24.0 .250 16.5-23.0
## 37 37      Saint Mary's  3 77.3 29.7-57.0 .520  9.7-20.3 .475  8.3-12.3
## 38 38          Kentucky  3 77.0 26.0-52.3 .497  3.3-11.0 .303 21.7-30.7
## 39  Â         Texas A&M  3 77.0 29.7-59.7 .497  6.3-18.3 .345 11.3-19.0
## 40  Â      South Dakota  1 77.0 26.0-70.0 .371  6.0-33.0 .182 19.0-29.0
##     FT% rank
## 1  .778    1
## 2  .754    2
## 3  .688    3
## 4  .571    4
## 5  .476    5
## 6  .800    5
## 7  .814    7
## 8  .889    8
## 9  .764    9
## 10 .645   10
## 11 .625   10
## 12 .769   12
## 13 .645   13
## 14 .815   14
## 15 .593   15
## 16 .556   15
## 17 .600   15
## 18 .700   15
## 19 .895   19
## 20 .723   20
## 21 .800   20
## 22 .759   22
## 23 .731   23
## 24 .750   24
## 25 .605   25
## 26 .838   26
## 27 .725   27
## 28 .727   28
## 29 .657   29
## 30 .792   30
## 31 .629   30
## 32 .721   32
## 33 .667   33
## 34 .750   33
## 35 .694   33
## 36 .717   36
## 37 .676   37
## 38 .707   38
## 39 .596   38
## 40 .655   38
scoring_game<-scoring_game[,-1]

This isn't the full dataset, since ESPN spreads their tables across multiple pages. So we'd need to write additional code to scrape data from all of these tables and combine them. But this should be enough to get you started with scraping data from HTML tables. (Look for future posts digging into these different topics.)

Last, let's lightly scratch the surface of reading in some unstructured data. First, we could read in a text file from a URL. Project Gutenberg includes tens of thousands of free books.1 We can read in the text of Mary Shelley's Frankenstein, using the readLines function.

Frankenstein<-readLines('http://www.gutenberg.org/files/84/84-0.txt')


Now we have an R object with every line from the text file. The beginning and end of the file contain standard text from Project Gutenberg, which we probably want to remove before we started any analysis of this file. We can find beginning and end with grep.

grep("START", Frankenstein)
## [1]   22 7493
grep("END", Frankenstein)
## [1] 7460 7781


We can use those row numbers to delete all of the standard text, leaving only the manuscript itself in a data frame (which will require some additional cleaning before we begin analysis).

Frankensteintext<-data.frame(Frankenstein[22:7460])

We can also use readLines to read in HTML files, though the results are a bit messy, and an HTML parser is needed. To demonstrate, the last example involves a classical psychology text, On the Witness Stand by Hugo Munsterberg, considered by many (including me) to be the founder of psychology and law as a subfield. His work, which is a collection of essays on the role psychologists can play in the courtroom, is available full-text online. It's a quick, fascinating read for anyone interested in applied psychology and/or law. Let's read in the first part of his book, the introduction.

Munsterberg1<-readLines("http://psychclassics.yorku.ca/Munster/Witness/introduction.htm")

This object contains his text, but also HTML code, which we want to remove. We can do this with the htmlParse function.

library(XML)
library(RCurl)
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
Mdoc<-htmlParse(Munsterberg1, asText=TRUE)
plain.text<-xpathSApply(Mdoc, "//p", xmlValue)
Munsterberg<-data.frame(plain.text)

This separates the text between opening and closing paragraph tags into cells in a data frame. We would need to do more cleaning and work on this file to use it for analysis. Another post for another day!

1There's an R package called gutenbergr that gives you direct access to Project Gutenberg books. If you'd like to analyze some of those works, it's best to start there. As usual, this was purely for demonstration.

Monday, April 23, 2018

T is for tibble

T is for Tibble For the letter D, I introduced data frames, a built-in R object type. But as I've learned more about R and, in particular, the tidyverse - most recently when I finally started reading Text Mining with R: A Tidy Approach - I learned about a more modern version of the R data frame: a tibble.

According to the tibble overview on the tidyverse website:
Tibbles are data.frames that are lazy and surly: they do less (i.e. they don't change variable names or types, and don't do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code.
What does this mean? Well, remember when I noted that a character variable in my measures data frame had been changed to a factor? I manually changed it back to character. But had I simply created a tibble with that information, I wouldn't have had to do anything. Data frames will also do partial matching on variable names - so if I requested Facebook$R, it would have given me all variables in that set starting with R. If I tried that with a tibble, I'd get an error message, because it matches variable references literally.

There are a few ways to create a tibble, one using the tibble packages and the other using the readr package. Fortunately, you don't need to worry about that, because we're just going to use the tidyverse package, which contains those two and more.

install.packages("tidyverse")
## Installing package into '~/R/win-library/3.4'
## (as 'lib' is unspecified)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

First, let's create a new tibble from scratch. The syntax is almost exactly the same as it was in the data frame post.

measures<-tibble(
  meas_id = c(1:6),
  name = c("Ruminative Response Scale","Savoring Beliefs Inventory",
           "Satisfaction with Life Scale","Ten-Item Personality Measure",
           "Cohen-Hoberman Inventory of Physical Symptoms",
           "Center for Epidemiologic Studies Depression Scale"),
  num_items = c(22,24,5,10,32,16),
  rev_items = c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
)
measures
## # A tibble: 6 x 4
##   meas_id                                              name num_items
##     <int>                                             <chr>     <dbl>
## 1       1                         Ruminative Response Scale        22
## 2       2                        Savoring Beliefs Inventory        24
## 3       3                      Satisfaction with Life Scale         5
## 4       4                      Ten-Item Personality Measure        10
## 5       5     Cohen-Hoberman Inventory of Physical Symptoms        32
## 6       6 Center for Epidemiologic Studies Depression Scale        16
## # ... with 1 more variables: rev_items <lgl>

As you can see, the name variable is character, not factor. I didn't have to do anything. Alternatively, you could convert an existing data frame, whether it's one you created or one that came with R/an R package.

car<-as_tibble(mtcars)
car
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
##  2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
##  3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
##  4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
##  5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
##  6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
##  7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
##  8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
##  9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
## # ... with 22 more rows

But chances are you'll be reading in data from an external file. The readr package can handle delimited and fixed width files. For instance, to read in the Facebook dataset I've been using, I just need the function read_tsv.

Facebook<-read_tsv("small_facebook_set.txt",col_names=TRUE)
## Parsed with column specification:
## cols(
##   .default = col_integer()
## )
## See spec(...) for full column specifications.
Facebook
## # A tibble: 257 x 111
##       ID gender  Rum1  Rum2  Rum3  Rum4  Rum5  Rum6  Rum7  Rum8  Rum9
##    <int>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1     1      1     3     1     3     2     3     1     2     1     1
##  2     2      1     1     1     1     1     1     1     0     0     1
##  3     3      1     4     3     3     4     3     4     2     3     3
##  4     4      0     4     0     0     2     0     0     4     0     2
##  5     5      1     2     2     2     1     2     1     1     1     1
##  6     6      0     2     4     3     4     2     3     2     2     3
##  7     7      1     1     2     3     2     0     2     3     1     2
##  8     8      0     2     1     1     2     0     2     3     3     3
##  9     9      1     4     1     4     4     3     2     2     1     1
## 10    10      1     4     2     0     3     4     2     4     1     2
## # ... with 247 more rows, and 100 more variables: Rum10 <int>,
## #   Rum11 <int>, Rum12 <int>, Rum13 <int>, Rum14 <int>, Rum15 <int>,
## #   Rum16 <int>, Rum17 <int>, Rum18 <int>, Rum19 <int>, Rum20 <int>,
## #   Rum21 <int>, Rum22 <int>, Sav1 <int>, Sav2 <int>, Sav3 <int>,
## #   Sav4 <int>, Sav5 <int>, Sav6 <int>, Sav7 <int>, Sav8 <int>,
## #   Sav9 <int>, Sav10 <int>, Sav11 <int>, Sav12 <int>, Sav13 <int>,
## #   Sav14 <int>, Sav15 <int>, Sav16 <int>, Sav17 <int>, Sav18 <int>,
## #   Sav19 <int>, Sav20 <int>, Sav21 <int>, Sav22 <int>, Sav23 <int>,
## #   Sav24 <int>, LS1 <int>, LS2 <int>, LS3 <int>, LS4 <int>, LS5 <int>,
## #   Extraverted <int>, Critical <int>, Dependable <int>, Anxious <int>,
## #   NewExperiences <int>, Reserved <int>, Sympathetic <int>,
## #   Disorganized <int>, Calm <int>, Conventional <int>, Health1 <int>,
## #   Health2 <int>, Health3 <int>, Health4 <int>, Health5 <int>,
## #   Health6 <int>, Health7 <int>, Health8 <int>, Health9 <int>,
## #   Health10 <int>, Health11 <int>, Health12 <int>, Health13 <int>,
## #   Health14 <int>, Health15 <int>, Health16 <int>, Health17 <int>,
## #   Health18 <int>, Health19 <int>, Health20 <int>, Health21 <int>,
## #   Health22 <int>, Health23 <int>, Health24 <int>, Health25 <int>,
## #   Health26 <int>, Health27 <int>, Health28 <int>, Health29 <int>,
## #   Health30 <int>, Health31 <int>, Health32 <int>, Dep1 <int>,
## #   Dep2 <int>, Dep3 <int>, Dep4 <int>, Dep5 <int>, Dep6 <int>,
## #   Dep7 <int>, Dep8 <int>, Dep9 <int>, Dep10 <int>, Dep11 <int>,
## #   Dep12 <int>, Dep13 <int>, Dep14 <int>, Dep15 <int>, Dep16 <int>

Finally, if you're working with SAS, SPSS, or Stata files, you can read those in with the tidyverse package, haven, and the functions read_sas, read_sav, and read_dta, respectively.

If for some reason you need a data frame rather than a tibble, you can convert a tibble to a data frame with class(as.data.frame(tibble_name)).

You can learn more about tibbles here and here.

Sunday, April 22, 2018

Statistics Sunday: Using semPlot

using semPlot with Facebook Models Today's post will be mostly demonstration, but I'll build on some of the things I covered in yesterday's semPlot post. This month, I've blogged about two SEM models: confirmatory factor analysis and latent variable path analysis. Using the models from those posts, I'll show how to diagram them in semPlot and how to make some changes to the appearance of the plots to make them presentation ready.

Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)

First up, confirmatory factor analysis. As part of that post, I tested models with the Satisfaction with Life Scale and Ruminative Response Scale. In fact, I gave a sneak preview of semPlot with the SWLS model.

SWL_Model<-'SWL =~ LS1 + LS2 + LS3 + LS4 + LS5'
library(lavaan)
## This is lavaan 0.5-23.1097
## lavaan is BETA software! Please report any bugs.
SWL_Fit<-cfa(SWL_Model, data=Facebook)
library(semPlot)
semPaths(SWL_Fit)

This diagram is fine to quickly show what the model looks like, but we want to tweak it if we were to use it for a presentation or publication. First up, I used very abbreviated variable names, so semPlot is having no trouble displaying all of it in the diagram. But I might want more descriptive names for my variables, and I probably want to do that without having to rename variables or rewrite my model.

labels<-c("Ideal Life","Excellent","Satisfied","Important","Change","SWL\nScale")
semPaths(SWL_Fit,nodeLabels=labels,sizeMan=10)

I've selected a keyword (or two) for each SWLS item, and made that the variable name. For instance, item 1 text is, "In most ways, my life is close to ideal." When creating your labels object, you want to put the y-variables in the order in which they appear in the equation(s), followed by x-variables, again in the order in which they appear.

We probably want to add a title and our parameter estimates. I'll use standardized estimates, since these are a bit easier to interpret. But there are a few other changes I'd like to make. semPlot automatically fades based on size of the parameter estimates, so larger estimates are darker than smaller estimates. It also changes the width of paths based on size, including for error estimates; so larger error means a thicker, darker line. Fortunately, I can turn these features off. I can also change the size of the arrowheads. (BTW, Man refers to manifest (or observed) variables, and Lat refers to latent variables (or factors); so these size arguments change the width of observed and latent variables, respectively.)

semPaths(SWL_Fit,what="std",edge.label.cex=0.75,edge.color="black",
nodeLabels=labels,sizeMan=10,sizeLat=10,fade=FALSE,esize=2,asize=2) title("Diener Satisfaction with Life Scale CFA", line=3)

Feel free to play around with the numbers I've selected to see how it affects the final look.

But then, the Satisfaction with Life Scale is a short measure and this is a small, simple model. What happens if I throw a much larger model at semPlot? Including full labels for the observed variables might overwhelm a large model, so I may simple want to use item number only and only label the factor.

RRS_Model<- '
  Depression =~ Rum1 + Rum2 + Rum3 + Rum4 + Rum6 + Rum8 + 
    Rum9 + Rum14 + Rum17 + Rum18 + Rum19 + Rum22
  Reflecting =~ Rum7 + Rum11 + Rum12 + Rum20 + Rum21
  Brooding =~ Rum5 + Rum10 + Rum13 + Rum15 + Rum16
'
RRS_Fit<-cfa(RRS_Model, data=Facebook)
rrslabels<-c(1:4,6,8,9,14,17:19,22,7,11,12,20,21,5,10,13,15,16,"Depression",
"Reflecting","Brooding") RRS<-semPaths(RRS_Fit,what="par",whatLabels="hide",nodeLabels=rrslabels, sizeLat=12, sizeMan=4.5,edge.label.cex=0.75, edge.color="black", asize=2)

Adding estimates to this diagram would probably make it difficult to read, so personally, I would probably also create a table with the actual parameter estimates and use the diagram for display purposes only. Instead, I allowed fading, so that stronger relationships would be darker than weaker relationships. This is done by asking it to include parameter estimate (what="par") but then to hide those labels (whatLabels="hide").

What about for even more complex models, like a latent variable path model? In that post, I tested a structural regression with rumination and depression.

Rum3_Dep<-'
Depression =~ Dep1 + Dep2 + Dep3 + Dep4 + Dep5 + Dep6 + Dep7 + Dep8 +
              Dep9 + Dep10 + Dep11 + Dep12 + Dep13 + Dep14 + Dep15 + Dep16
DRR =~ Rum1 + Rum2 + Rum3 + Rum4 + Rum6 + Rum8 + Rum9 + Rum14 + Rum17 + Rum18 + 
              Rum19 + Rum22
Reflecting =~ Rum7 + Rum11 + Rum12 + Rum20 + Rum21
Brooding =~ Rum5 + Rum10 + Rum13 + Rum15 + Rum16
Depression ~ DRR + Reflecting + Brooding
'
RD3<-sem(Rum3_Dep, data=Facebook)
semPaths(RD3)

For this type of model, I tend prefer a different rotation, with the x-variables on the left and the y-variables on the right. I also want to customize my labels. Remember, y goes before x, but observed go before latent, so the order is: y-observed, x-observed, y-latent, x-latent. I may also use Lisrel style errors, which are simple arrows rather than curved double-pointed arrows, and may change the appearance of the covariances for the exogenous latent variables, so they don't get as lost.
rrsdlabels<-c(1:16,1:4,6,8,9,14,17:19,22,7,11,12,20,21,5,10,13,15,16,"Depression",
"Dep-Related","Reflecting","Brooding") semPaths(RD3, rotation=2,nodeLabels=rrsdlabels,sizeMan=3,
style="lisrel",curvePivot=TRUE, edge.color="black", ) title("Rumination and Depression Structural Regression Model")

If all paths are significant, it's okay not to have parameter estimates or fading displayed. I used fading for the measurement model, but I could just have easily done this instead: added additional text indicating that everything is significant. Pretend for the sake of argument that this is true for this model. We could easily add this descriptive text on the model drawing.

semPaths(RD3, rotation=2,nodeLabels=rrsdlabels,sizeMan=3,style="lisrel",
curvePivot=TRUE, edge.color="black") title("Rumination and Depression Structural Regression Model") text(0,-.9,"All paths and variances significant, p<0.05")

Either approach is fine - it really depends on what information you want to communicate. Do you want to demonstrate which items best measure the underlying construct? Or do you simply want to show that all items significantly contribute to the measurement of the construct? The same goes for LVPA; do you want to show what variables are the strongest predictors or just that all variables are significant predictors? In this particular case, not all paths are significant - brooding and reflecting do not significantly predict depression. So I may want a different approach to highlight this fact.

semPaths(RD3, rotation=2,nodeLabels=rrsdlabels,sizeMan=3,style="lisrel",
curvePivot=TRUE, edge.color="black", what="par",whatLabels="hide",) title("Rumination and Depression Structural Regression Model")

With the fading on, we see that Depression-Related Rumination is the strongest predictor of Depression. Reflecting is the weakest predictor, but both brooding and reflecting are non-significant, and the paths are very light.

Back to A to Z posts tomorrow, where I'll talk about a new data structure - tibbles!

Saturday, April 21, 2018

S is for semPlot Package

S is for semPlot Today's post is on my favorite, "where have you been all my statistical life?" package, semPlot. When I took my first structural equation modeling course 11 years ago, for our final project, we had to use one of the models we learned in the course on a set of data and write up a report. I remember spending way too much time creating the diagram to accompany my report - if memory serves, this is the first time I ever requested an extension on a project or paper, because it took even more time than I realized it would.

Those days are done, thanks to this amazing package, which takes your fitted model and generates a diagram. There are many ways to customize the appearance, which I'll get into shortly. But first, let's grab some data to use for examples. For my Statistics Sunday post, I'll go through using semPlot with the CFA and LVPA from previous A to Z posts. So for today, I'll use some of the sample datasets that come with lavaan, my favorite SEM package.

library(lavaan)
## This is lavaan 0.5-23.1097
## lavaan is BETA software! Please report any bugs.
data(PoliticalDemocracy)

This dataset is Industrialization and Political Democracy dataset, which contains measures of political democracy and industrialization in developing countries:
  • y1 = Expert ratings of the freedom of the press in 1960
  • y2 = Freedom of political opposition in 1960
  • y3 = Fairness of elections in 1960
  • y4 = Effectiveness of the elected legislature in 1960
  • y5 = Expert ratings of the freedom of the press in 1965
  • y6 = Freedom of political opposition in 1965
  • y7 = Fairness of elections in 1965
  • y8 = Effectiveness of the elected legislature in 1965
  • x1 = Gross national product per capita in 1960
  • x2 = Inanimate energy consumption per capital in 1960
  • x3 = Percentage of the labor force in industry in 1960

Essentially, the x variables measure industrialization. The y variables measure political democracy at two time points. We can fit a simple measurement model using the 1960 political democracy variables.

Free.Model <- 'Fr60 =~ y1 + y2 + y3 + y4'
fit <- sem(Free.Model, data=PoliticalDemocracy)
summary(fit, standardized=TRUE, fit.measures=TRUE)
## lavaan (0.5-23.1097) converged normally after  26 iterations
## 
##   Number of observations                            75
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               10.006
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.007
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              159.183
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.948
##   Tucker-Lewis Index (TLI)                       0.843
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -704.138
##   Loglikelihood unrestricted model (H1)       -699.135
## 
##   Number of free parameters                          8
##   Akaike (AIC)                                1424.275
##   Bayesian (BIC)                              1442.815
##   Sample-size adjusted Bayesian (BIC)         1417.601
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.231
##   90 Percent Confidence Interval          0.103  0.382
##   P-value RMSEA <= 0.05                          0.014
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.046
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   Fr60 =~                                                               
##     y1                1.000                               2.133    0.819
##     y2                1.404    0.197    7.119    0.000    2.993    0.763
##     y3                1.089    0.167    6.529    0.000    2.322    0.712
##     y4                1.370    0.167    8.228    0.000    2.922    0.878
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .y1                2.239    0.512    4.371    0.000    2.239    0.330
##    .y2                6.412    1.293    4.960    0.000    6.412    0.417
##    .y3                5.229    0.990    5.281    0.000    5.229    0.492
##    .y4                2.530    0.765    3.306    0.001    2.530    0.229
##     Fr60              4.548    1.106    4.112    0.000    1.000    1.000

While the RMSEA and TLI show poor fit, CFI and SRMR show good fit. We'll ignore fit measures for now, though, and instead show how semPlot can be used to create a diagram of this model.

library(semPlot)
semPaths(fit)


Now we have a basic digram of our model. The dotted line from y1 to F60 indicates that this loading was fixed (in this case at 1). Rounded arrows pointing at each of the variables reflect their variance.

We could honestly stop here, but why? Let's have some fun with semPlot. First of all, we could give it a title. By adding line=3, I keep the title from appearing right on top of the latent variable (and covering up the variance).

semPaths(fit)
title("Political Democracy in 1960", line=3)


I could also add the parameter estimates to my figure. They default to large green text, so I prefer to change it to black (edge.color="black") and a smaller font size (edge.label.cex=0.75).

semPaths(fit, what="par", edge.label.cex=0.75, edge.color="black")
title("Political Democracy in 1960", line=3)

Now we're getting somewhere, though I may bump the estimates font size up slightly. I'm not a huge fan of this particular dataset, because the variable names are not descriptive. And semPlot seriously truncates variable names in the figure. But you can easily override both of these issues. Let's give our dataset better column names, tell semPlot not to truncate these names (with nCharNodes, using 0 tells semPlot not to limit name length) and increase font size on those variable names (with sizeMan, which we'll set at 10pt font). I'll also ask for standardized estimates in this version.

colnames(PoliticalDemocracy)<-c("Press.60","PolOp.60","Elect.60","Leg.60",
                                "Press.65","PolOp.65","Elect.65","Leg.65",
                                "GNP.60","Energy.60","Labor.60")
Free.Model <- 'Fr60 =~ Press.60 + PolOp.60 + Elect.60 + Leg.60'
fit <- sem(Free.Model, data=PoliticalDemocracy)
semPaths(fit, what="std",nCharNodes=0, edge.label.cex=1, sizeMan=10, edge.color="black")
title("Political Democracy in 1960", line=3)


Other fun tricks:

There are three different layout options: tree (default), circle, and spring. Here's what the other two look like.

semPaths(fit, layout="circle")

semPaths(fit, layout="spring")

And we can also change orientation.

semPaths(fit, rotation=1) #This is also the default

semPaths(fit, rotation=2)

semPaths(fit, rotation=3)

semPaths(fit, rotation=4)


We can change the appearance of errors. "mx" is the default, but another option is "lisrel".

semPaths(fit, style="lisrel")


And final trick I'll show: we can add some descriptive text to our model drawing.

semPaths(fit,rotation=2)
text(-0.5,-1.0,labels="Using the Bollen 1989\n Industrialization Political Democracy dataset\navailable in the lavaan package")

I'm just scratching the surface of what this package can do. We can change the appearance of parameter estimates, thickness of lines, and give whatever labels we want. More on that tomorrow, when I use the semPlot package to work through generating final diagrams for my CFA and LVPA models!

Friday, April 20, 2018

R is for R Origin Story

An important place in the history of statistics is AT&T Bell Laboratories. And one of the key parts of that story is the development of a language for statistical computing called S.

Prior to 1975 or so, statistical researchers at Bell Labs used Fortran for their statistical computing. But from 1975 to 1976, John Chambers, Rick Becker, and Allan Wilks developed a language written in Fortran that would allow for more interactive analysis. Though they threw around many names, including SCS (Statistical Computing System), they settled on S - this is, after all, the same institution that brought you the C programming language, so single letter names were kind of a thing. In 1988, they created New S, switching some of the internal functions from Fortran to C.

Out of S grew S-Plus, a commercial statistical computing language, which arrived on the scene also in 1988. And in 1993, an open source version S and S-Plus appeared: R. The syntax for New S, S-Plus, and R are very similar - in fact, much of the syntax will run on all three of these languages. But being open source, anyone who wants R can access it and its source code, under the GNU General Public License.

BTW, these are my GNU coding buddies - their names are Gus (left) and Gary (right), and yes, I can tell the difference between them. Gary has a wider face.
R is an interpreted language with features of object-oriented and functional programming. It, of course, can be used as a big calculator, but there are much more powerful things you can do with R. Personally, I haven't even begun to scratch the surface of some of these capabilities - both on this blog and as an R user. For instance, I still struggle with some of the programming concepts like creating loops.

R is maintained by the R Development Core Team, and is a not-for-profit organization. Because of the terms of the GNU GPL, anyone who wants can access the source code, make changes, and even distribute their version - though because of the terms of the GPL, they also must make their altered source code available. The purpose is to put power back in the hands of the people using the software and language. And anyone who wants can develop R packages that extend the capabilities of R.

I've had colleagues criticize this open source nature of R, because of concerns about quality issues. Anyone can write an R package. But this is a collaborative project and anyone can access source code, meaning if there are issues, someone is very likely to find them and fix them (or weed out the bad). Not to mention - knock on wood - but I've never had any major issues with bugs in R, and yet I've purchased expensive proprietary software riddled with bugs. (Some of us may remember the SPSS version 16 release, which would often refuse to save or save then delete your data files. Sometimes you wouldn't know it until you went to open your data, only to find the file missing. That was probably what made me finally dump SPSS back in 2009. And it's also why, when I went back to using proprietary statistical software during my time at VA, I opted to learn Stata on my own than go back to SPSS. I use SPSS once or twice a week in my current job, but I'm trying to move us toward using R whenever possible.)

What are some things you should know about working with R?
  • When writing up results for publication, always clearly state that you used R for analysis and the exact version number. I'm currently running 3.4.3, but I believe 3.4.4 recently dropped. Be sure to provide a citation in your writeup as well.
  • Always state the exact R packages you used and provide citations for those as well - if there is a paper about the package available in a scholarly journal, cite that. For example, if you used lavaan, you'd want to provide a citation to this paper.
  • Write code for everything you do in R, especially when working with a single dataset, and save it as an R script. This keeps you from losing track of how you did things that you'll very likely have to recreate at some point, and it's invaluable to be able to lift from your own code for analyses on different datasets. Part of my motivation of writing this particular blog series is so I can easily refer back to posts to remind myself how to code something specific.
  • When using code someone else has written, change things. Break stuff. See what happens if you change the name of something or the number of loops or whatever. This is how you learn what parts of the code are fixed and what you can change for your needs. And of course, try combining code from different sources. You may end up with perfectly functioning code; you may end up with Frankenstein's monster. Either way, you'll learn something.
  • If you can't determine the meaning of an error message, try Googling it verbatim. Chances are, someone has written about that particular error.
  • Know that there is a steep learning curve and you will make mistakes and/or end up frustrated. Stick to it and know that there's a huge community out there willing to help you.
Sound off, R users: what do you think people should know about working with R?