Sunday, November 25, 2018

Statistics Sunday: Introduction to Regular Expressions

In my last Statistics Sunday post, I briefly mentioned the concept of regular expressions, also known as regex (though note that in some contexts, these refer to different things - see here). A regular expression is a text string, which you ask your program to match. You can use this to look for files with a particular name or extension, or search a corpus of text for a specific word or word(s) that match a certain pattern. This concept is used in many programming languages, and R is no exception. In order to use a regular expression, you need a regular expression engine. Fortunately, R base and many R packages come with a regular expression engine, which you call up with a function. In R base, those functions are things like grep, grepl, regexpr, and so on.

Today, I want to talk about some of the syntax you use to describe different patterns of text characters, which can include letters, digits, white space, and punctuation, starting from the most simple regular expressions. Typing out a specific text string matches only that. In last week's example, I was working with files that all start with the word "exam." So if I had other files in that folder that didn't follow that naming convention, I could look specifically for those by simply typing exam into the pattern attribute of my list.files() function. These files follow exam with a date, formatted as YYYYMMDD, and a letter to delineate each file from that day (a to z). Pretty much all of our files end in a, but in the rare instances where they send two files for a day, that one would end in b. How can I write a regular expression that matches this pattern?

First, the laziest approach, using the * character (which is sometimes called a greedy operator), which tells the regular expression engine to look for any number of characters following a string. For that, I could use this pattern:

list.files(pattern = "exam*")

A less lazy approach that still uses the + greedy operator would be specify the numbers I know will be the same for all files, which would be the year, 2018. In fact, if all of my exam files from every year were in a single folder, I could use that to select just one year:

list.files(pattern = "exam2018*")

Let's say I want all files from October 2018. I could do this:

list.files(pattern = "exam201810*")

That might be as fancy as I need to get for my approach.

We can get more specific by putting information into brackets, such as [a-z] to tell the program to match something with a lowercase letter in that range or [aA] to tell the program to look for something with either lowercase or uppercase a. For example, what if I wanted to find every instance of the word "witch" in The Wizard of Oz? Sometimes, the characters refer to a specific witch (Good Witch, Bad Witch, and so on) or to the concept of being a witch (usually lowercase, such as "I am a good witch"). I could download The Wizard of Oz through Project Gutenberg (see this post for how to use the guternbergr package; the ID for The Wizard of Oz is 55), then run some text analysis to look for any instance of Witch or witch:

witch <- WoO[grep("[Ww]itch", WoO$text),]

Here's what the beginning of the witch dataset looks like:

> witch
# A tibble: 141 x 2
   gutenberg_id text     
         <int> <chr>                                                                   
1           55 "  12.  The Search for the Wicked Witch"                                 
2           55 "  23.  Glinda The Good Witch Grants Dorothy's Wish"                     
3           55 We are so grateful to you for having killed the Wicked Witch of the     
4           55 killed the Wicked Witch of the East?  Dorothy was an innocent, harmless 
5           55 "\"She was the Wicked Witch of the East, as I said,\" answered the little"
6           55 " where the Wicked Witch ruled.\""   
7           55 When they saw the Witch of the East was dead the Munchkins sent a swift   
 8           55 "messenger to me, and I came at once.  I am the Witch of the North.\""   
 9           55 "\"Oh, gracious!\" cried Dorothy.  \"Are you a real witch?\""             
10           55 "\"Yes, indeed,\" answered the little woman.  \"But I am a good witch, and"
# ... with 131 more rows

I'm working on putting together a more detailed post (or posts) about regular expressions, including more complex examples and the components of a regular expression, so check back for that soon!

No comments:

Post a Comment