Sunday, January 27, 2019

Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

Stacked Bar Chart for Rank Data At work on Friday, I was trying to figure out the best way to display some rank data. What I had were rankings from 1-5 for 10 factors considered most important in a job (such as Salary, Insurance Benefits, and the Opportunity to Learn), meaning each respondent chose and ranked the top 5 from those 10, and the remaining 5 were unranked by that respondent. Without even thinking about the missing data issue, I computed a mean rank and called it a day. (Yes, I know that ranks are ordinal and means are for continuous data, but my goal was simply to differentiate importance of the factors and a mean seemed the best way to do it.) Of course, then we noticed one of the factors had a pretty high average rank, even though few people ranked it in the top 5. Oops.

So how could I present these results? One idea I had was a stacked bar chart, and it took a bit of data wrangling to do it. That is, the rankings were all in separate variables, but I want them all on the same chart. Basically, I needed to create a dataset with:
    1 variable to represent the factor being ranked
  • 1 variable to represent the ranking given (1-5, or 6 that I called "Not Ranked")
  • 1 variable to represent the number of people giving that particular rank that particular factor

What I ultimately did was run frequencies for the factor variables, turn those frequency tables into data frames, and merged them together with rbind. I then created chart with ggplot. Here's some code for a simplified example, which only uses 6 factors and asks people to rank the top 3.

First, let's read in our sample dataset - note that these data were generated only for this example and are not real data:

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
ranks <- read_csv("C:/Users/slocatelli/Desktop/sample_ranks.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   RespID = col_integer(),
##   Salary = col_integer(),
##   Recognition = col_integer(),
##   PTO = col_integer(),
##   Insurance = col_integer(),
##   FlexibleHours = col_integer(),
##   OptoLearn = col_integer()
## )

This dataset contains 7 variables - 1 respondent ID and 6 variables with ranks on factors considered important in a job: salary, recognition from employer, paid time off, insurance benefits, flexible scheduling, and opportunity to learn. I want to run frequencies for these variables, and turn those frequency tables into a data frame I can use in ggplot2. I'm sure there are much cleaner ways to do this (and please share in the comments!), but here's one not so pretty way:

salary <-$Salary))
salary$Name <- "Salary"
recognition <-$Recognition))
recognition$Name <- "Recognition by \nEmployer"
PTO <-$PTO))
PTO$Name <- "Paid Time Off"
insurance <-$Insurance))
insurance$Name <- "Insurance"
flexible <-$FlexibleHours))
flexible$Name <- "Flexible Schedule"
learn <-$OptoLearn))
learn$Name <- "Opportunity to \nLearn"

rank_chart <- rbind(salary, recognition, PTO, insurance, flexible, learn)
rank_chart$Var1 <- as.numeric(rank_chart$Var1)

With my not-so-pretty data wrangling, the chart itself is actually pretty easy:

ggplot(rank_chart, aes(fill = Var1, y = Freq, x = Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Ranking of Factors Most Important in a Job") +
  ylab("Frequency") +
  xlab("Job Factors") +
  scale_fill_continuous(name = "Ranking",
                      breaks = c(1:4),
                      labels = c("1","2","3","Not Ranked")) +
  theme_bw() +

Based on this chart, we can see the top factor is Salary. Insurance is slightly more important than paid time off, but these are definitely the top 2 and 3 factors. Recognition wasn't ranked by most people, but those who did considered it their #2 factor; ditto for flexible scheduling at #3. Opportunity to learn didn't make the top 3 for most respondents.

Friday, January 25, 2019

Natural Graph

Via Not Awful and Boring, this reddit post discusses a really cool natural graph, measuring the amount of sunlight per day, created with a tree and a magnifying glass:

Apparently, this device is a Campbell-Stokes recorder.

Thursday, January 24, 2019

Long Time, No Write

Wow, it's been way too long since I've posted anything! Lots of life changes recently, including moving to a new place and fighting with Comcast to get internet there. I still need to set up my office, and I plan on doing lots of writing in my new dedicated space. I'm planning on more statistics posts and a few more surprises this year.

Work has also been busy. At the moment:

  • I'm working on three content validation studies, including analyzing data for two job analysis surveys and gearing up for a third
  • I've wrapped up the first phase of analysis on our salary and satisfaction survey, and have some cool analysis planned for phase 2
  • I finished a time study on our national and state exams
  • I'm awaiting feedback on the first draft of a chapter about standard setting I coauthored with some coworkers
  • I'm learning how to be a supervisor, now that I have someone working for me! That's right, I'm no longer a department of one
Once I get through some of the most pressing project work, I'm going to take some of my work time to teach myself data forensics as it applies to the testing. In fact, this book has been on my to-read shelf since my annual employee evaluation back in November. Look for blog posts on that!