reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv", col_names = TRUE)
reads2019 <- reads2019 %>% mutate(perpage = Pages/sum(Pages))
reads2019 %>% ggplot(aes(perpage)) + geom_histogram() + scale_x_continuous(labels = percent, breaks = seq(0,.05,.005)) + xlab("Percentage of Total Pages Read") + ylab("Books")
This post also seems like a great opportunity to hop on my statistical highhorse and talk about the difference between a histogram and a bar chart. Why is this important? With everything going on in the world - pandemics, political elections, etc. - I've seen lots of comments on others' intelligence, many of which show a misunderstanding of the most well-known histogram: the standard normal curve. You see, raw data, even from a huge number of people and even on a standardized test, like a cognitive ability (aka: IQ) test, is never as clean or pretty as it appears in a histogram.
Histograms use a process called "binning", where ranges of scores are combined to form one of the bars. The bins can be made bigger (including a larger range of scores) or smaller, and smaller bins will start showing the jagged nature of most data, even so-called normally distributed data.
As one example, let's show what my percent figure would look like as a bar chart instead of a histogram (like the one above).
reads2019 %>% ggplot(aes(perpage)) + geom_bar() + scale_x_continuous(labels = percent, breaks = seq(0,.05,.005)) + xlab("Percentage of Total Pages Read") + ylab("Books")
Now, my reading dataset is small - only 87 observations. What happens if I generate a large, random dataset?
set.seed(42) test <- tibble(ID = c(1:10000), value = rnorm(10000)) test %>% ggplot(aes(value)) + geom_histogram()
test %$% n_distinct(value)
##  10000
test %>% ggplot(aes(value)) + geom_histogram(bins = 10000)
How about if we mimic cognitive ability scores, with a mean of 100 and a standard deviation of 15? I'll even force it to have whole numbers, so we don't have decimal places to deal with.
CogAbil <- tibble(Person = c(1:10000), Ability = rnorm(10000, mean = 100, sd = 15)) CogAbil <- CogAbil %>% mutate(Ability = round(Ability, digits = 0)) CogAbil %$% n_distinct(Ability)
##  103
CogAbil %>% ggplot(aes(Ability)) + geom_histogram() + labs(title = "With 30 bins") + theme(plot.title = element_text(hjust = 0.5))
CogAbil %>% ggplot(aes(Ability)) + geom_histogram(bins = 103) + labs(title = "With 1 bin per whole-point score") + theme(plot.title = element_text(hjust = 0.5))
This is not to say histograms lie - they simplify. And they're not really meant to be used the way many people try to use them.