Sunday, May 6, 2018

Statistics Sunday: Tokenizing Text

Statistics Sunday: Tokenizing Text I recently started working my way through Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. There are many interesting ways text analysis can be used, not only for research interests but for marketing and business insights. Today, I'd like to introduce one of the basic concepts necessary for conducting text analysis: tokens.

In text analysis, a token is any kind of meaningful text unit that can be analyzed. Frequently, a token is a word and the process of tokenizing splits up the text into individual words and counts up how many times each word appears in the text. But a token could also be a phrase (such as each two-word combination present in a text, which is called a bi-gram), a sentence, a paragraph, even a whole chapter. Obviously, the size of the token you choose impacts what kind of analysis you can do. Generally, people choose smaller tokens, like words.

Let's use R to download the text of a classic book (which I did previously in this post, but today, I'll do in an easier way) and tokenize it by word.

Any text available in the Project Gutenberg repository can be downloaded, with header and footer information stripped out, with the guternbergr package.

install.packages("gutenbergr")
library(gutenbergr)

The package comes with a dataset, called gutenberg_metadata, that contains a list, by ID, of all text available. Let's use The War of the Worlds by H.G. Wells as our target book. We can find our target book like this:

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
gutenberg_metadata %>%
  filter(title == "The War of the Worlds")
## # A tibble: 3 x 8
##   gutenberg_id title   author   gutenberg_autho… language gutenberg_books…
##          <int> <chr>   <chr>               <int> <chr>    <chr>           
## 1           36 The Wa… Wells, …               30 en       Movie Books/Sci…
## 2         8976 The Wa… Wells, …               30 en       Movie Books/Sci…
## 3        26291 The Wa… Wells, …               30 en       <NA>            
## # ... with 2 more variables: rights <chr>, has_text <lgl>

The ID for The War of the Worlds is 36. Now I can use that information to download the text of the book into a data frame, using the gutenbergr function, gutenberg_download.

warofworlds<-gutenberg_download(36)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

Now I have a dataset with two variables: one containing the Project Gutenberg ID for the text (which is helpful if you create a dataset with multiple texts, perhaps all by the same author or within the same genre) and one containing a line of text. To tokenize our dataset, we need the R package, tidytext.

install.packages("tidytext")
library(tidytext)

We can tokenize with the function, unnest_tokens: first we tell it to do so by word then we tell it which column to look in to find the tokens.

tidywow<-warofworlds %>%
  unnest_tokens(word, text)

Now we have a dataset with each word from the book, one after the other. There are duplicates in here, because I haven't told R to count up the words. Before I do that, I probably want to tell R to ignore extremely common words, like "the," "and," "to," and so on. In text analysis, these are called stop words, and tidytext comes with a dataset called stop_words that can be used to drop stop words from your text data.

tidywow<-tidywow %>%
  anti_join(stop_words)
## Joining, by = "word"

Last, we have R count up the words.

wowtokens<-tidywow %>%
  count(word, sort=TRUE)
head(wowtokens)
## # A tibble: 6 x 2
##   word         n
##   <chr>    <int>
## 1 martians   163
## 2 people     159
## 3 black      122
## 4 time       121
## 5 road       104
## 6 night      102

After removing stop words, of which there may be hundreds or thousands in any text, the most common words are: Martians, people, black, time, and road.

No comments:

Post a Comment