Sunday, November 5, 2017

Statistics Sunday: Random versus Pseudo-Random

One of the key concepts behind statistics is the idea of "random" - random variables, random selection, random assignment (when we start getting into experimentation and the analyses that go along with that), even random effects. But as with likelihood, this is a term that gets thrown around a lot but not really discussed semantically.

Exacerbating the problem is that random is often used colloquially to mean something very different from its meaning in scientific and statistical applications. In layman's terms, we often use the word random to mean something was unexpected. Even I'm guilty of using the word "random" in this way - in fact, one of my favorite jokes to make is that something was "random, emphasis on the dom (dumb)."

But in statistics, we use random in very specific ways. When we refer to random variables, we mean something that is "free to vary." When we start talking about things like random selection, we usually mean that each case has an equal chance of being chosen, but even then, we mean that the selection process is free to vary. There is no set pattern, such as picking every third case. In either of these instances, the resulting random thing is not unexpected. We can quantify the probability of the different outcomes. But we're allowing things to vary as they will.

There are a variety of instances of random processes, what we something call stochastic processes.  You may recall a previous blog post about random walks and martingales. Things like white noise, and the behavior of the stock market are examples of stochastic processes, Even the simple act of flipping a coin multiple times is a sort of stochastic process. We can very easily quantify an outcome, or even a set of outcomes. But we allow each outcome to vary naturally.

Unsurprisingly, people often use computers to generate random numbers for them. Computers are great for generating large sets of random numbers. In fact, I often use R to generate random datasets for me, and can set constraints on what I want that dataset to look like. For instance, let's say I want to generate a random dataset with two groups, experimental and control, and I want to ensure they have different means but similar standard deviation. I did something much like this when I demonstrated the t-test:

experimental<-data.frame(group=1, score=rnorm(100,75.0,15.0)
control<-data.frame(group=2, score=rnorm(100,50.0,15.0)
full<-rbind(experimental,control)
library(psych)
describe(full)

This code gives me a dataset with 200 observations, 100 for each group. The experimental group is fixed to have a mean of approximately 75, and the control group a mean of approximately 50. The rnorm command tells R that I want a random, normally distributed dataset.

Based on the describe command, the overall dataset has a mean score of 63.18, a standard deviation of 19.94, skewness of 0.12 and kurtosis of -0.69. A random dataset, right?

But...


That's right, computers don't actually give you random numbers. They give you pseudo-random numbers: numbers that have been generated to mimic stochastic processes. For all intents and purposes, you can call them random. But technically, they aren't. Instead, there are sets full of pseudo-random numbers that are used when you ask the program to generate random numbers for you.

But there is an upside to this. You can recreate any string of random numbers anytime you need to. You do this by setting your seed - telling the program to use a specific set of random numbers. This means that I can generate some random numbers and then later, recreate that same set. Let's test this out, shall we? First, I need to tell R to use a specific random number seed.

set.seed(35)
experimental<-data.frame(group=1, score=rnorm(100,75.0,15.0)
control<-data.frame(group=2, score=rnorm(100,50.0,15.0)
full<-rbind(experimental,control)
library(psych)
describe(full)


This generates a dataset with the following descriptive results: mean = 66.02, SD = 19.06, median = 66.43, min = 22.54, max = 125.07, skewness = 0.09, and kurtosis = -0.22.

Now, let's copy that exact code and run it again, only this time, we'll change the names, so we generate all new datasets.

set.seed(35)
experimental2<-data.frame(group=1, score=rnorm(100,75.0,15.0)
control2<-data.frame(group=2, score=rnorm(100,50.0,15.0)
full2<-rbind(experimental2,control2)
library(psych)
describe(full2)


And I get a dataset with the following descriptive results: mean = 66.02, SD = 19.06, median = 66.43, min = 22.54, max = 125.07, skewness = 0.09, and kurtosis = -0.22.

Everything is exactly the same. This is because I told it at the beginning to use the same seed as before, and by calling the set.seed command a second time, I also asked the program to go back to the beginning of that seed. As a result I can recreate my "randomly" generated dataset perfectly. But because I can recreate it every time, the numbers aren't actually free to vary, so they are not truly random.

2 comments:

  1. This is a big problem in software security. Modern encryption depends on large prime numbers, and they need to be as random as possible. So in software we talk about cryptographically strong RNGs. See, e.g., https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator

    ReplyDelete
  2. I love using http://random.org which claims to generate really random, rather than just pseudo-random, numbers (produced by a sort of Brownian motion machine, I think). The concept of randomness is fascinating philosophically, I find it very hard to get a grip on what it "means" at an ontological level: obviously random sequences are impossible to predict, but once you take away the observer, what makes things random or non-random? Is anything ever *completely* random?

    ReplyDelete