Deeply Trivial: Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

Sunday, January 27, 2019

Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

Stacked Bar Chart for Rank Data At work on Friday, I was trying to figure out the best way to display some rank data. What I had were rankings from 1-5 for 10 factors considered most important in a job (such as Salary, Insurance Benefits, and the Opportunity to Learn), meaning each respondent chose and ranked the top 5 from those 10, and the remaining 5 were unranked by that respondent. Without even thinking about the missing data issue, I computed a mean rank and called it a day. (Yes, I know that ranks are ordinal and means are for continuous data, but my goal was simply to differentiate importance of the factors and a mean seemed the best way to do it.) Of course, then we noticed one of the factors had a pretty high average rank, even though few people ranked it in the top 5. Oops.

So how could I present these results? One idea I had was a stacked bar chart, and it took a bit of data wrangling to do it. That is, the rankings were all in separate variables, but I want them all on the same chart. Basically, I needed to create a dataset with:

1 variable to represent the ranking given (1-5, or 6 that I called "Not Ranked")
1 variable to represent the number of people giving that particular rank that particular factor

What I ultimately did was run frequencies for the factor variables, turn those frequency tables into data frames, and merged them together with rbind. I then created chart with ggplot. Here's some code for a simplified example, which only uses 6 factors and asks people to rank the top 3.

First, let's read in our sample dataset - note that these data were generated only for this example and are not real data:

library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.1

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ranks <- read_csv("C:/Users/slocatelli/Desktop/sample_ranks.csv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   RespID = col_integer(),
##   Salary = col_integer(),
##   Recognition = col_integer(),
##   PTO = col_integer(),
##   Insurance = col_integer(),
##   FlexibleHours = col_integer(),
##   OptoLearn = col_integer()
## )

This dataset contains 7 variables - 1 respondent ID and 6 variables with ranks on factors considered important in a job: salary, recognition from employer, paid time off, insurance benefits, flexible scheduling, and opportunity to learn. I want to run frequencies for these variables, and turn those frequency tables into a data frame I can use in ggplot2. I'm sure there are much cleaner ways to do this (and please share in the comments!), but here's one not so pretty way:

salary <- as.data.frame(table(ranks$Salary))
salary$Name <- "Salary"
recognition <- as.data.frame(table(ranks$Recognition))
recognition$Name <- "Recognition by \nEmployer"
PTO <- as.data.frame(table(ranks$PTO))
PTO$Name <- "Paid Time Off"
insurance <- as.data.frame(table(ranks$Insurance))
insurance$Name <- "Insurance"
flexible <- as.data.frame(table(ranks$FlexibleHours))
flexible$Name <- "Flexible Schedule"
learn <- as.data.frame(table(ranks$OptoLearn))
learn$Name <- "Opportunity to \nLearn"

rank_chart <- rbind(salary, recognition, PTO, insurance, flexible, learn)
rank_chart$Var1 <- as.numeric(rank_chart$Var1)

With my not-so-pretty data wrangling, the chart itself is actually pretty easy:

ggplot(rank_chart, aes(fill = Var1, y = Freq, x = Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Ranking of Factors Most Important in a Job") +
  ylab("Frequency") +
  xlab("Job Factors") +
  scale_fill_continuous(name = "Ranking",
                      breaks = c(1:4),
                      labels = c("1","2","3","Not Ranked")) +
  theme_bw() +
  theme(plot.title=element_text(hjust=0.5))

Based on this chart, we can see the top factor is Salary. Insurance is slightly more important than paid time off, but these are definitely the top 2 and 3 factors. Recognition wasn't ranked by most people, but those who did considered it their #2 factor; ditto for flexible scheduling at #3. Opportunity to learn didn't make the top 3 for most respondents.

5 comments:

ChuckJanuary 28, 2019 at 8:39 AM
Nice article! Perhaps a more efficient way to proceed?

library(tidyverse)

Use tidyr::gather to move from wide to long. It does most of the work for you.

xxx <- read_csv("sample_ranks.csv", col_names = TRUE)
xxxx <- gather(xxx, JobFactor, Ranking, -RespID)
str(xxxx)

Looks like you have no trouble with ggplot2 but sometimes it's easier to use someone
elses function... :-) just added this function to a package I work on

devtools::install_github(
repo = "IndrajeetPatil/ggstatsplot", # package path on GitHub
dependencies = TRUE, # installs packages which ggstatsplot depends on
upgrade_dependencies = TRUE # updates any out of date dependencies
)

library(ggstatsplot)
ggbarstats(data = xxxx,
main = Ranking,
condition = JobFactor,
labels.legend = c("Not Ranked", "Third", "Second", "First"),
bar.proptest = FALSE,
title = "Ranking of Factors Most Important in a Job",
xlab = "Job Factors"
)

That get you pretty close. But we can make better titles...

xxxx$JobFactor <- case_when(
xxxx$JobFactor == "Recognition" ~ "Recognition \nby Employer",
xxxx$JobFactor == "PTO" ~ "Paid Time Off",
xxxx$JobFactor == "FlexibleHours" ~ "Flexible \nSchedule",
xxxx$JobFactor == "OptoLearn" ~ "Opportunity \nto Learn",
TRUE ~ as.character(xxxx$JobFactor)
)
ggbarstats(data = xxxx,
main = Ranking,
condition = JobFactor,
labels.legend = c("Not Ranked", "Third", "Second", "First"),
bar.proptest = FALSE,
title = "Ranking of Factors Most Important in a Job",
xlab = "Job Factors",
x.axis.orientation = "slant",
ggplot.component = ggplot2::labs(subtitle = NULL)
)
ReplyDelete
Replies
MarcJanuary 28, 2019 at 8:58 AM
I like the plot, the next time I'm creating some monstrous barplot with multiple facets I'll hopefully remember this.

I used the R code below to produce the data frame for the plot, whether this counts as "cleaner" I'm not sure.

# Tabulate the number of occurrences of each job factor and ranking
rank_chart <- tidyr::gather(ranks, Name, Var1, -RespID) %>%
dplyr::group_by(Name, Var1) %>%
dplyr::summarise(Freq = n()) %>%
dplyr::ungroup()

# Fix up the display of Job Factor levels
rank_chart <- dplyr::mutate(rank_chart, Name = dplyr::case_when(
Name == "Recognition" ~ "Recognition by \nEmployer",
Name == "PTO" ~ "Paid Time Off",
Name == "FlexibleHours" ~ "Flexible Schedule",
Name == "OpttoLearn" ~ "Opportunity to \nLearn",
TRUE ~ Name
))
ReplyDelete
Replies
UnknownJanuary 28, 2019 at 10:18 AM
Is there any reason that you used scale_fill_continuous? I would have replaced `fill = Var1` with `fill = factor(Var1)`, then used something like scale_brewer_continuous(type = "seq"). The continuous scale for categorical data just irks me.
ReplyDelete
Replies
MariaJanuary 30, 2019 at 7:14 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Bernardo LaresJanuary 31, 2019 at 1:19 PM
Hope this helps for further plots and function:

library(lares)
library(tidyverse)

ranks <- read.csv("~/Desktop/sample_ranks.csv")

plot_survey_rank <- function(answers, ignore = c(1)){
if (length(ignore) > 0) {
answers <- answers[-ignore]
}
p <- gather(answers) %>% freqs(key, value) %>%
ggplot(aes(x = key, y = n, fill = as.integer(value), group = key, label = p)) +
geom_bar(stat = "identity") + theme_minimal() +
geom_text(position = position_stack(vjust = .5), colour = "white", size = 2.7) +
coord_flip() +
labs(title = "Survey Results",
subtitle = "Answers for the n question",
x = "", y = "[%]", fill = "",
caption = paste("Obs.:", nrow(answers)))
return(p)
}

plot_survey_rank(ranks)
ReplyDelete
Replies

Add comment