Thursday, June 25, 2020

Flying Saucers and Bright Lights: A Data Visualization

UFO Sightings by Shape and Year

Earlier last week, I taught part 2 of a course on using R and tidyverse for my work colleagues. I wanted a fun dataset to use as an example for coding exercises throughout. There was really only one choice.

I found this great dataset through kaggle.com - UFO sightings reported to the National UFO Reporting Center (NUFORC) through 2014. This dataset gave lots of variables we could play around with, and I'd like to use it in a future session with my colleagues to talk about the process of cleaning data.

If you're interested in learning more about R and tidyverse, you can access my slides from the sessions here. (We stopped at filtering and picked up there for part 2, so everything is in one Powerpoint file.)

While working with the dataset to plan my learning sessions, I started playing around and thought it would be fun to show the various shapes of UFOs reported over time, to see if there were any shifts. Spoiler: There were. But I needed to clean the data a bit first.

setwd("~/Downloads/UFO Data")
library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
options(scipen = 999)

UFOs <- read_csv("UFOsightings.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   datetime = col_character(),
##   city = col_character(),
##   state = col_character(),
##   country = col_character(),
##   shape = col_character(),
##   `duration (seconds)` = col_double(),
##   `duration (hours/min)` = col_character(),
##   comments = col_character(),
##   `date posted` = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )
## Warning: 4 parsing failures.
##   row                col               expected   actual               file
## 27823 duration (seconds) no trailing characters `        'UFOsightings.csv'
## 35693 duration (seconds) no trailing characters `        'UFOsightings.csv'
## 43783 latitude           no trailing characters q.200088 'UFOsightings.csv'
## 58592 duration (seconds) no trailing characters `        'UFOsightings.csv'

There are 30 shapes represented in the data. That's a lot to show in a single figure.

UFOs %>%
  summarise(shapes = n_distinct(shape))
## # A tibble: 1 x 1
##   shapes
##    <int>
## 1     30

If we look at the different shapes in the data, we can see some overlap, as well as shapes with low counts.

UFOs %>%
  group_by(shape) %>%
  summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 30 x 2
##    shape    count
##    <chr>    <int>
##  1 changed      1
##  2 changing  1962
##  3 chevron    952
##  4 cigar     2057
##  5 circle    7608
##  6 cone       316
##  7 crescent     2
##  8 cross      233
##  9 cylinder  1283
## 10 delta        7
## # ... with 20 more rows

For instance, "changed" only appears in one record. But "changing," which appears in 1,962 records should be grouped with "changed." After inspecting all the shapes, I identified the following categories that accounted for most of the different shapes:

  • changing, which includes both changed and changing
  • circles, like disks, domes, and spheres
  • triangles, like deltas, pyramids, and triangles
  • four or more sided: rectangles, diamonds, and chevrons
  • light, which counts things like flares, fireballs, and lights

I also made an "other" category for shapes with very low counts that didn't seem to fit in the categories above, like crescents, teardrops, and formations with no further specification of shape. Finally, shape was blank for some records, so I made an "unknown" category. Here's the code I used to recategorize shape.

changing <- c("changed", "changing")
circles <- c("circle", "disk", "dome", "egg", "oval","round", "sphere")
triangles <- c("cone","delta","pyramid","triangle")
fourormore <- c("chevron","cross","diamond","hexagon","rectangle")
light <- c("fireball","flare","flash","light")
other <- c("cigar","cylinder","crescent","formation","other","teardrop")
unknown <- c("unknown", 'NA')

UFOs <- UFOs %>%
  mutate(shape2 = ifelse(shape %in% changing,
                         "changing",
                         ifelse(shape %in% circles,
                                "circular",
                                ifelse(shape %in% triangles,
                                       "triangular",
                                       ifelse(shape %in% fourormore,
                                              "four+-sided",
                                              ifelse(shape %in% light,
                                                     "light",
                                                     ifelse(shape %in% other,
                                                            "other","unknown")))))))

My biggest question mark was cigar and cylinder. They're not really circles, nor do they fall in the four or more sided category. I could create another category called "tubes," but ultimately just put them in other. Using the code above as an example, you could see what happens to the chart if you put them in another category or create one of their own.

For the chart, I dropped the unknowns.

UFOs <- UFOs %>%
  filter(shape2 != "unknown")

Now, to plot shapes over time, I need to extract date information. The "datetime" variable is currently a character, so I have to convert that to a date. I then pulled out year, so that each point on my figure was the count of that shape observed during a given year.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
UFOs <- UFOs %>%
  mutate(Date2 = as.Date(datetime, format = "%m/%d/%Y"),
         Year = year(Date2))

Now we have all the information we need to plot shapes over time, to see if there have been changes. We'll create a summary dataframe by Year and shape2, then create a line chart with that information.

Years <- UFOs %>%
  group_by(Year, shape2) %>%
  summarise(count = n())
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggthemes)

Years %>%
  ggplot(aes(Year, count, color = shape2)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(1910,2020,10)) +
  scale_y_continuous(breaks = seq(0,3000,500), labels = comma) +
  labs(color = "Object Shape", title = "From Flying Saucers to Bright Lights:\nSightings of UFO Shapes Over Time") +
  ylab("Number of Sightings") +
  theme_economist_white() +
  scale_color_tableau() +
  theme(plot.title = element_text(hjust = 0.5))

Until the mid-90s, the most commonly seen UFO was circular. After that, light shapes became much more common. I'm wondering if this could be explained in part by UFOs in pop culture, moving from the flying saucers of earlier sci-fi to the bright lights without discernible shape in the more recent sci-fi. The third most common shape is our "other" category, which suggests we might want to rethink that one. It could be that some of the shapes within that category are common enough to warrant their own category, while receiving other for those that don't have a good category of their own. Cigar and cylinder, for instance, have high counts and could be put in their own category. Feel free to play around with the data and see what you come up with!

Wednesday, June 10, 2020

Space Force: A Review

I've continued to work from home during our shelter-in-place (something my boss recently told me we'll be doing for a while). During my copious downtime, I've gotten to watch a lot of things I've had on my watch-list, including the Netflix original series, Space Force.


I've made my way through season 1 of the series, and thoroughly enjoyed it. I was surprised to learn - partway through watching - that critics did not enjoy the series nearly as much as I did. I'll get to that shortly.

I loved the political satire element of the show, that it was inspired by statements by our buffoon of a president. And I loved the periodic texts and tweets they referenced from a character they only referred to as "POTUS" (although, we all know who they mean). But really, I felt the critics were expecting something very different from what the show gave us, and that is the reason for their negative review.

While the concept is hilarious, and Mark Naird (Steve Carrell's character) is often a buffoon, the show is really a family drama framed by absurdist comedy. General Naird is a single father of a teenage daughter (Diana Silver, from Booksmart, which I also thoroughly enjoyed), after his wife (played brilliantly by Lisa Kudrow) is imprisoned for an unmentioned crime (which earned her 40-60 years, so clearly really bad). The show deals with a variety of family issues, not just the aforementioned single parenthood, but also teenage rebellion and substance abuse, fear of abandonment, and a parent who often feels married to their job. It dealt with the concept of an open marriage in a way that was authentic, while also being heartbreaking and funny at the same time. The show made me cry just as often as it made me laugh, and I could often relate to Mark's character - his heartbreak when his wife suggested an open marriage was so real, I bawled. It poked fun at the full political spectrum, as well as at Boomers, X-ers, and Millennials alike.

I think a lot of people were expecting Michael Scott as a general, but Mark Naird - though often a goof who really didn't understand science, which was an important part of his job, personified by his chief scientist (played so wonderfully by John Malkovich: better casting does not exist) - showed a surprising depth and understanding of people, in ways that both surprised and confirmed the conclusions of his scientists. Michael Scott seemed oblivious to the people who worked from him and showed zero understanding of people skills, while Mark Naird thought first and foremost about the people, and spoke eloquently on the topic.

I especially loved the character of Captain Ali (played by Tawny Newsome) and look forward (hopefully) to learning more about her character. Of all the characters on the show, she's my favorite.

It was also a joy to see Fred Willard as Mark's elderly father, who since filming his role has passed away. He will very much be missed and I'll be interested in seeing how they deal with the actor's death (since season 2 has not even been greenlit, let alone filmed). My only complaint was with the cheap jokes at his elderly mother's expense, including at one point showing the caretaker giving her CPR while Mark's father obliviously (and jovially) spoke on the phone. Mark's mother obviously has both lung (due to her being on oxygen) and heart (due to the CPR) issues, and as the daughter of a man with similar issues, I would have wished a show with so much heart had been more delicate with these conditions, rather than using them for cheap laughs.

My only disappointment with Space Force (other than my complaint above) is with the critics' reaction to it. I sincerely hope there is a season 2.