Deeply Trivial: Statistics Sunday: Using Text Analysis to Become a Better Writer

Sunday, August 19, 2018

Statistics Sunday: Using Text Analysis to Become a Better Writer

Using Text Analysis to Become a Better Writer We all have words we love to use, and that we perhaps use too much. As an example: I have a tendency to use the same transitional statements, to the point that, before I submit a manuscript, I do a find all to see how many times I've used some of my favorites, e.g., additionally, though, and so on.

I'm sure we all have our own words we use way too often.

Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today's demonstration, I read in my (still in-progress) novel - a murder mystery called Killing Mr. Johnson - and did the same type of text analysis I've been demonstrating in recent posts.

To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.

setwd("~/Dropbox/Writing/Killing Mr. Johnson")

library(tidyverse)

KMJ_text <- read_lines('KMJ_full.txt')

KMJ <- tibble(KMJ_text) %>%
  mutate(linenumber = row_number())

I kept my line numbers, which I could use in some future analysis. For now, I'm going to tokenize my data, drop stop words, and examine my most frequently used words.

library(tidytext)
KMJ_words <- KMJ %>%
  unnest_tokens(word, KMJ_text) %>%
  anti_join(stop_words)

## Joining, by = "word"

KMJ_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 75) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()

Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It's a murder mystery, so I'm not too surprised that words like "body" and "death" are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the "-ly adverbs."

ly_words <- KMJ_words %>%
  filter(str_detect(word, ".ly")) %>%
  count(word, sort = TRUE)

head(ly_words)

## # A tibble: 6 x 2
##   word         n
##   <chr>    <int>
## 1 emily      599
## 2 finally     80
## 3 quickly     60
## 4 emily’s     53
## 5 suddenly    39
## 6 quietly     38

Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren't actually -ly adverbs. I'll filter those out then take a look at what I have left.

filter_out <- c("emily", "emily's", "emily’s","family", "reply", "holy")

ly_words <- ly_words %>%
  filter(!word %in% filter_out)

ly_words %>%
  filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() + xlab(NULL) + coord_flip()

I use "finally", "quickly", and "suddenly" far too often. "Quietly" is also up there. I think the reason so many writers hate on adverbs is because it can encourage lazy writing. You might write that someone said something quietly or softly, but is there a better word? Did they whisper? Mutter? Murmur? Hiss? Did someone "move quickly" or did they do something else - run, sprint, dash?

At the same time, sometimes adverbs are necessary. I mean, can I think of a complete sentence that only includes an adverb? Definitely. Still, it might become tedious if I keep depending on the same words multiple times, and when a fiction book (or really any kind of writing) is tedious, we often give up. These results give me some things to think about as I edit.

Still have some big plans on the horizon, including some new statistics videos, a redesigned blog, and more surprises later! Thanks for reading!