Tuesday, April 24, 2018

U is for (Data From) URLs

Working with URLs in R Up to now, we've been working with files saved locally on your computer. But that limits you to files that can be easily saved to your computer and, up to now, structured data. As we move from pure statistics to a data science approach, more and more, we'll be working with data stored somewhere in the cloud. Today, I'd like to introduce how to begin working with data available from URLs. There are many directions you can take, depending on the type of data you're working with and how you plan to analyze it, so we'll only be scratching the surface of what could lead to multiple more advanced topics.

URL stands for Uniform Resource Locator. It can refer to a webpage (http), file transfer protocol (ftp), email (mailto), and any number of other resources. We'll focus today on collecting data from webpages.

Let's start with the simplest way to access data at a URL - using the same syntax we've used to read delimited files saved locally. For this example, I'm using Chris Albon's War of the Five Kings dataset, which contains data on battles from George R.R. Martin's A Song of Ice and Fire series (Game of Thrones for those of you familiar with the HBO show based on the series.)

The data is in a CSV file, and contains data on 25 variables for 38 battles. Unfortunately, the dataset hasn't been updated since 2014, but it's based on the books rather than the show. It's fine for our present example. Since it's a CSV file, we can use a read CSV function to read the data into an R data frame. (I selected read_csv from the tidyverse, to create a tibble.) Where we would normally put the path for our locally saved file, we use the path to the CSV file saved on GitHub.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
Battles<-read_csv('https://raw.githubusercontent.com/chrisalbon/war_of_the_five_kings_dataset/master/5kings_battles_v1.csv')
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   year = col_integer(),
##   battle_number = col_integer(),
##   major_death = col_integer(),
##   major_capture = col_integer(),
##   attacker_size = col_integer(),
##   defender_size = col_integer(),
##   summer = col_integer()
## )
## See spec(...) for full column specifications.

You should now have a data frame called Battles, containing 38 observations on 25 variables. You could run some simple descriptives on the dataset but a fun analysis might be adopt a network analysis approach, and map out the attacker and defender kings, to show who is attacking whom and demonstrate the insanity that is ASoIaF. Because that kind of nerdy analysis sounded fun, I went ahead and did it, basing my code on code from this post. To keep this post from getting overly long, I made this a separate post, available here.

Another way to access data from URLs is to read in an HTML table. ESPN provides tons of sports-related data on their site, often in the form of HTML tables. Let's practice pulling some of the NCAA Division I men's basketball data, specifically their team scoring per game. This table also provides a few other variables, like percentages of field goals, 3-pointers, and free throws made. One of the best ways to cleanly scrape these data is with the XML package.

library(XML)
scoring_game<-readHTMLTable('http://www.espn.com/mens-college-basketball/statistics/team/_/stat/scoring-per-game/sort/avgPoints/count/', which=1)

The which argument tells which HTML table to read in. There's only 1 on the page, but without specifying, readHTMLTable will read the entire page for all tables and create a list object that contains each table as a separate data frame, embedded within the list (even if it only finds 1 table). But since I wanted to go straight to the data frame without going through the list first, I specified the table to read. Now we have a data frame called scoring_game, with columns named after the column headings from the HTML tables. There are a few quirks to this data frame we want to fix. First, every 10 rows, ESPN adds an additional header row, which R read in as a data row. I want to drop those, since they don't contain any data. Second, each team is ranked, but when there are ties, the rank cell is blank, which R filled in with a special character (Â). Rather than manually cleaning up these data, I'll just have R generate new ranks for this data frame, allowing for ties, and drop the RK variable.

scoring_game<-scoring_game[-c(11,22,33),]
library(tidyverse)
scoring_game %>% mutate(rank = min_rank(desc(PTS)))
##    RK              TEAM GP  PTS   FGM-FGA  FG%   3PM-3PA  3P%   FTM-FTA
## 1   1     UNC Asheville  1 98.0 35.0-75.0 .467 14.0-31.0 .452 14.0-18.0
## 2   2            Oregon  2 95.5 31.0-55.0 .564  9.0-19.5 .462 24.5-32.5
## 3   3           Wofford  1 94.0 34.0-63.0 .540 15.0-31.0 .484 11.0-16.0
## 4   4           Seattle  1 90.0 33.0-72.0 .458 12.0-31.0 .387 12.0-21.0
## 5   5               USC  2 89.0 34.5-70.0 .493 10.0-25.5 .392 10.0-21.0
## 6   Â        Fort Wayne  1 89.0 33.0-60.0 .550 11.0-33.0 .333 12.0-15.0
## 7   7  Central Michigan  3 87.7 30.7-63.7 .482 14.7-36.3 .404 11.7-14.3
## 8   8        Miami (OH)  1 87.0 33.0-60.0 .550 13.0-26.0 .500   8.0-9.0
## 9   9        Seton Hall  2 86.5 28.5-61.0 .467  8.5-22.5 .378 21.0-27.5
## 10 10            Xavier  2 86.0 29.0-56.5 .513  8.0-18.5 .432 20.0-31.0
## 11  Â             Rider  1 86.0 35.0-76.0 .461  6.0-22.0 .273 10.0-16.0
## 12 12     West Virginia  3 85.7 30.7-66.0 .465  7.7-21.3 .359 16.7-21.7
## 13 13 Northern Colorado  4 85.5 30.3-60.8 .498 10.0-26.0 .385 15.0-23.3
## 14 14         Villanova  6 83.8 27.5-58.0 .474 12.7-30.5 .415 16.2-19.8
## 15 15          NC State  1 83.0 28.0-61.0 .459 11.0-30.0 .367 16.0-27.0
## 16  Â             Texas  1 83.0 31.0-67.0 .463 11.0-24.0 .458 10.0-18.0
## 17  Â               BYU  1 83.0 34.0-76.0 .447  6.0-27.0 .222  9.0-15.0
## 18  Â     Virginia Tech  1 83.0 30.0-54.0 .556  9.0-18.0 .500 14.0-20.0
## 19 19         Marquette  3 82.7 27.7-56.0 .494 10.3-23.7 .437 17.0-19.0
## 20 20       North Texas  6 82.5 28.7-60.3 .475  8.7-22.8 .380 16.5-22.8
## 21  Â        Ohio State  2 82.5 28.5-69.0 .413 11.5-33.0 .348 14.0-17.5
## 22 22           Buffalo  2 82.0 30.0-64.5 .465 11.0-30.5 .361 11.0-14.5
## 23 23              Duke  4 81.5 29.3-61.0 .480  8.8-26.5 .330 14.3-19.5
## 24 24            Kansas  5 80.6 28.2-61.6 .458  9.2-23.4 .393 15.0-20.0
## 25 25       Austin Peay  2 80.5 31.5-68.0 .463  4.5-19.5 .231 13.0-21.5
## 26 26       Utah Valley  2 80.0 28.5-56.5 .504  7.5-17.0 .441 15.5-18.5
## 27 27           Clemson  3 79.7 30.0-61.7 .486  7.3-20.0 .367 12.3-17.0
## 28 28  Middle Tennessee  2 79.5 32.0-55.5 .577  7.5-18.0 .417  8.0-11.0
## 29 29        Washington  2 79.0 29.5-57.5 .513  8.5-20.0 .425 11.5-17.5
## 30 30  Western Kentucky  4 78.5 29.0-59.8 .485  5.3-16.5 .318 15.3-19.3
## 31  Â            Baylor  2 78.5 30.5-59.0 .517  6.5-16.0 .406 11.0-17.5
## 32 32    Oklahoma State  3 78.3 24.7-67.0 .368  8.3-22.7 .368 20.7-28.7
## 33 33          Oklahoma  1 78.0 29.0-69.0 .420  4.0-20.0 .200 16.0-24.0
## 34  Â          Bucknell  1 78.0 23.0-55.0 .418 11.0-20.0 .550 21.0-28.0
## 35  Â          Canisius  1 78.0 23.0-61.0 .377  7.0-29.0 .241 25.0-36.0
## 36 36               LSU  2 77.5 27.5-57.0 .482  6.0-24.0 .250 16.5-23.0
## 37 37      Saint Mary's  3 77.3 29.7-57.0 .520  9.7-20.3 .475  8.3-12.3
## 38 38          Kentucky  3 77.0 26.0-52.3 .497  3.3-11.0 .303 21.7-30.7
## 39  Â         Texas A&M  3 77.0 29.7-59.7 .497  6.3-18.3 .345 11.3-19.0
## 40  Â      South Dakota  1 77.0 26.0-70.0 .371  6.0-33.0 .182 19.0-29.0
##     FT% rank
## 1  .778    1
## 2  .754    2
## 3  .688    3
## 4  .571    4
## 5  .476    5
## 6  .800    5
## 7  .814    7
## 8  .889    8
## 9  .764    9
## 10 .645   10
## 11 .625   10
## 12 .769   12
## 13 .645   13
## 14 .815   14
## 15 .593   15
## 16 .556   15
## 17 .600   15
## 18 .700   15
## 19 .895   19
## 20 .723   20
## 21 .800   20
## 22 .759   22
## 23 .731   23
## 24 .750   24
## 25 .605   25
## 26 .838   26
## 27 .725   27
## 28 .727   28
## 29 .657   29
## 30 .792   30
## 31 .629   30
## 32 .721   32
## 33 .667   33
## 34 .750   33
## 35 .694   33
## 36 .717   36
## 37 .676   37
## 38 .707   38
## 39 .596   38
## 40 .655   38
scoring_game<-scoring_game[,-1]

This isn't the full dataset, since ESPN spreads their tables across multiple pages. So we'd need to write additional code to scrape data from all of these tables and combine them. But this should be enough to get you started with scraping data from HTML tables. (Look for future posts digging into these different topics.)

Last, let's lightly scratch the surface of reading in some unstructured data. First, we could read in a text file from a URL. Project Gutenberg includes tens of thousands of free books.1 We can read in the text of Mary Shelley's Frankenstein, using the readLines function.

Frankenstein<-readLines('http://www.gutenberg.org/files/84/84-0.txt')


Now we have an R object with every line from the text file. The beginning and end of the file contain standard text from Project Gutenberg, which we probably want to remove before we started any analysis of this file. We can find beginning and end with grep.

grep("START", Frankenstein)
## [1]   22 7493
grep("END", Frankenstein)
## [1] 7460 7781


We can use those row numbers to delete all of the standard text, leaving only the manuscript itself in a data frame (which will require some additional cleaning before we begin analysis).

Frankensteintext<-data.frame(Frankenstein[22:7460])

We can also use readLines to read in HTML files, though the results are a bit messy, and an HTML parser is needed. To demonstrate, the last example involves a classical psychology text, On the Witness Stand by Hugo Munsterberg, considered by many (including me) to be the founder of psychology and law as a subfield. His work, which is a collection of essays on the role psychologists can play in the courtroom, is available full-text online. It's a quick, fascinating read for anyone interested in applied psychology and/or law. Let's read in the first part of his book, the introduction.

Munsterberg1<-readLines("http://psychclassics.yorku.ca/Munster/Witness/introduction.htm")

This object contains his text, but also HTML code, which we want to remove. We can do this with the htmlParse function.

library(XML)
library(RCurl)
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
Mdoc<-htmlParse(Munsterberg1, asText=TRUE)
plain.text<-xpathSApply(Mdoc, "//p", xmlValue)
Munsterberg<-data.frame(plain.text)

This separates the text between opening and closing paragraph tags into cells in a data frame. We would need to do more cleaning and work on this file to use it for analysis. Another post for another day!

1There's an R package called gutenbergr that gives you direct access to Project Gutenberg books. If you'd like to analyze some of those works, it's best to start there. As usual, this was purely for demonstration.

No comments:

Post a Comment