12  Bayesian statistics

In 2016, FiveThirtyEight showed this chart depicting distributions for the percent of the popular vote for each candidate:

The colored areas represent values with an 80% chance of including the actual result, according to the FiveThirtyEight model.

But what does this mean in the context of the theory we have previously covered, in which these percentages are considered fixed? Furthermore, election forecasters make probabilistic statements such “Obama has a 90% chance of winning the election.” Note that in the context of an urn model, this would be equivalent to stating that the probability \(p>0.5\) is 90%. However, the urn model \(p\) is a fixed parameter and it does not make sense to talk about probability. With Bayesian statistics, we assume \(p\) is random variable, and thus, a statement such as “90% chance of winning” is consistent with the mathematical approach. Forecasters also use models to describe variability at different levels. For example, sampling variability, pollster to pollster variability, day to day variability, and election to election variability. One of the most successful approaches used for this are hierarchical models, which can be explained in the context of Bayesian statistics.

The approach described in the previous chapters, where the parameters are considered fixed, is often referred to as frequentist.

In this chapter, we will briefly describe Bayesian statistics. We use three cases studies: 1) interpreting diagnostic tests for a rare disease, 2) predicting the performance of an athlete, and 3) estimating the probability of Hillary Clinton winning in 2016 using pre-election poll data. For an in-depth treatment of this topic, we recommend one of the following textbooks:

12.1 Bayes theorem

We start by describing Bayes theorem, using a hypothetical cystic fibrosis test as an example. Suppose a test for cystic fibrosis has an accuracy of 99%. We will use the following notation:

\[ \mbox{Pr}(+ \mid D=1)=0.99, \mbox{Pr}(- \mid D=0)=0.99 \]

with \(+\) meaning a positive test and \(D\) representing if you actually have the disease (1) or not (0).

Imagine we select a random person and they test positive. What is the probability that they have the disease? We write this as \(\mbox{Pr}(D=1 \mid +)?\) The cystic fibrosis rate is 1 in 3,900, which implies that \(\mbox{Pr}(D=1)=0.00025\). To answer this question, we will use Bayes theorem, which in general tells us that:

\[ \mbox{Pr}(A \mid B) = \frac{\mbox{Pr}(B \mid A)\mbox{Pr}(A)}{\mbox{Pr}(B)} \]

This equation, when applied to our problem, becomes:

\[ \begin{aligned} \mbox{Pr}(D=1 \mid +) & = \frac{ P(+ \mid D=1) \cdot P(D=1)} {\mbox{Pr}(+)} \\ & = \frac{\mbox{Pr}(+ \mid D=1)\cdot P(D=1)} {\mbox{Pr}(+ \mid D=1) \cdot P(D=1) + \mbox{Pr}(+ \mid D=0) \mbox{Pr}( D=0)} \end{aligned} \]

Plugging in the numbers, we get:

\[ \frac{0.99 \cdot 0.00025}{0.99 \cdot 0.00025 + 0.01 \cdot (.99975)} = 0.02 \]

According to the above, despite the test having 0.99 accuracy, the probability of having the disease given a positive test is only 0.02. This might seem counter-intuitive to some, but it is because we must factor in the very rare probability that a randomly chosen person has the disease. To illustrate this, we run a Monte Carlo simulation.

12.1.1 Bayes theorem simulation

The following simulation is meant to help you visualize Bayes theorem. We start by randomly selecting 100,000 people from a population in which the disease in question has a 1 in 4,000 prevalence.

prev <- 0.00025
N <- 100000
outcome <- sample(c("Disease","Healthy"), N, replace = TRUE, 
                  prob = c(prev, 1 - prev))

Note that there are very few people with the disease:

N_D <- sum(outcome == "Disease")
N_D
#> [1] 23
N_H <- sum(outcome == "Healthy")
N_H
#> [1] 99977

Also, there are many people without the disease, which makes it more probable that we will see some false positives given that the test is not perfect. Now, each person gets the test, which is correct 99% of the time:

accuracy <- 0.99
test <- vector("character", N)
test[outcome == "Disease"]  <- sample(c("+", "-"), N_D, replace = TRUE, 
                                    prob = c(accuracy, 1 - accuracy))
test[outcome == "Healthy"]  <- sample(c("-", "+"), N_H, replace = TRUE, 
                                    prob = c(accuracy, 1 - accuracy))

Given that there are so many more controls than cases, even with a low false positive rate, we end up with more controls than cases in the group that tested positive:

table(outcome, test)
#>          test
#> outcome       -     +
#>   Disease     0    23
#>   Healthy 99012   965

From this table, we see that the proportion of positive tests that have the disease is 23 out of 988. We can run this over and over again to see that, in fact, the probability converges to about 0.022.

12.2 Priors, posteriors and and credible intervals

In the previous chapter, we computed an estimate and margin of error for the difference in popular votes between Hillary Clinton and Donald Trump. We denoted the parameter, the the difference in popular votes, with \(\mu\). The estimate was between 2 and 3 percent, and the confidence interval did not include 0. A forecaster would use this to predict Hillary Clinton would win the popular vote. But to make a probabilistic statement about winning the election, we need to use a Bayesian approach.

We start the Bayesian approach by quantifying our knowledge before seeing any data. This is done using a probability distribution referred to as a prior. For our example, we could write:

\[ \mu \sim N(\theta, \tau) \]

We can think of \(\theta\) as our best guess for the popular vote difference had we not seen any polling data, and we can think of \(\tau\) as quantifying how certain we feel about this guess. Generally, if we have expert knowledge related to \(\mu\), we can try to quantify it with the prior distribution. In the case of election polls, experts use fundamentals, which include, for example, the state of the economy, to develop prior distributions.

The data is used to update our initial guess or prior belief. This can be done mathematically if we define the distribution for the observed data for any given \(\mu\). In our particular example, we would write down a model the average of our polls, which is the same as before:

\[ \bar{X} \mid \mu \sim N(\mu, \sigma/\sqrt{N}) \]

As before, \(\sigma\) describes randomness due to sampling and the pollster effects. In the Bayesian contexts, this is referred to as the sampling distribution. Note that we write the conditional \(\bar{X} \mid \mu\) because \(\mu\) is now considered a random variable.

We do not show the derivations here, but we can now use calculus and a version of Bayes’ Theorem to derive the distribution of \(\mu\) conditional of the data, referred to as the posterior distribution. Specifically, we can show the \(\mu \mid \bar{X}\) follows a normal distribution with expected value:

\[ \begin{aligned} \mbox{E}(\mu \mid \bar{X}) &= B \theta + (1-B) \bar{X}\\ &= \theta + (1-B)(\bar{X}-\theta)\\ \mbox{with } B &= \frac{\sigma^2/N}{\sigma^2/N+\tau^2} \end{aligned} \] and standard error :

\[ \mbox{SE}(\mu \mid \bar{X})^2 = \frac{1}{1/\sigma^2+1/\tau^2}. \]

Note that the expected value is a weighted average of our prior guess \(\theta\) and the observed data \(\bar{X}\). The weight depends on how certain we are about our prior belief, quantified by \(\tau\), and the precision \(\sigma/N\) of the summary of our observed data. This weighted average is sometimes referred to as shrinking because it shrinks estimates towards a prior value.

These quantities are useful for updating our beliefs. Specifically, we use the posterior distribution not only to compute the expected value of \(\mu\) given the observed data, but also, for any probability \(\alpha\), we can construct intervals centered at our estimate and with \(\alpha\) chance of occurring.

To compute a posterior distribution and construct a credible interval, we define a prior distribution with mean 0% and standard error 3.5%, which can be interpreted as follows: before seeing polling data, we don’t think any candidate has the advantage, and a difference of up to 7% either way is possible. We compute the posterior distribution using the equations above:

theta <- 0
tau <- 0.035
sigma_n <- results$se
sigma <- one_poll_per_pollster |>
  summarise(sigma = sd(spread)) |>
  pull(sigma)
x_bar <- results$avg
B <- sigma_n^2 / (sigma_n^2 + tau^2)

posterior_mean <- B*theta + (1 - B)*x_bar
posterior_se <- sqrt(1/(1/sigma^2 + 1/tau^2))

posterior_mean
#> [1] 0.0281
posterior_se
#> [1] 0.0199

Since we know the posterior distribution is normal, we can construct a credible interval like this:

posterior_mean + c(-1, 1) * qnorm(0.975) * posterior_se
#> [1] 0.0160 0.0401

Furthermore, we can now make the probabilistic statement that we could not make with the frequentists approach by computing the posterior probability of Hillary winning the popular vote. Specifically, \(\mbox{Pr}(\mu>0 \mid \bar{X})\) can be computed as follows:

1 - pnorm(0, posterior_mean, posterior_se)
#> [1] 1

According to the above, we are 100% sure Clinton will win the popular vote, which seems too overconfident. Additionally, it is not in agreement with FiveThirtyEight’s 81.4%. What explains this difference? There is a level of uncertainty that we are not yet describing, and we will return to that in Chapter 13.

12.3 Exercises

1. In 1999, in England, Sally Clark1 was found guilty of the murder of two of her sons. Both infants were found dead in the morning, one in 1996 and another in 1998. In both cases, she claimed the cause of death was sudden infant death syndrome (SIDS). No evidence of physical harm was found on the two infants, so the main piece of evidence against her was the testimony of Professor Sir Roy Meadow, who testified that the chances of two infants dying of SIDS was 1 in 73 million. He arrived at this figure by finding that the rate of SIDS was 1 in 8,500, and then calculating that the chance of two SIDS cases was 8,500 \(\times\) 8,500 \(\approx\) 73 million. Which of the following do you agree with?

  1. Sir Meadow assumed that the probability of the second son being affected by SIDS was independent of the first son being affected, thereby ignoring possible genetic causes. If genetics plays a role then: \(\mbox{Pr}(\mbox{second case of SIDS} \mid \mbox{first case of SIDS}) > \mbox{P}r(\mbox{first case of SIDS})\).
  2. Nothing. The multiplication rule always applies in this way: \(\mbox{Pr}(A \mbox{ and } B) =\mbox{Pr}(A)\mbox{Pr}(B)\)
  3. Sir Meadow is an expert and we should trust his calculations.
  4. Numbers don’t lie.

2. Let’s assume that there is, in fact, a genetic component to SIDS and the probability of \(\mbox{Pr}(\mbox{second case of SIDS} \mid \mbox{first case of SIDS}) = 1/100\), is much higher than 1 in 8,500. What is the probability of both of her sons dying of SIDS?

3. Many press reports stated that the expert claimed the probability of Sally Clark being innocent was 1 in 73 million. Perhaps the jury and judge also interpreted the testimony this way. This probability can be written as the probability of a mother is a son-murdering psychopath given that two of her children are found dead with no evidence of physical harm. According to Bayes’ rule, what is this?

4. Assume that the chance of a son-murdering psychopath finding a way to kill her children, without leaving evidence of physical harm, is:

\[ \mbox{Pr}(A \mid B) = 0.50 \]

with A = two of her children are found dead with no evidence of physical harm, and B = a mother is a son-murdering psychopath = 0.50. Assume that the rate of son-murdering psychopaths mothers is 1 in 1,000,000. According to Bayes’ Theorem, what is the probability of \(\mbox{Pr}(B \mid A)\) ?

5/. After Sally Clark was found guilty, the Royal Statistical Society issued a statement saying that there was “no statistical basis” for the expert’s claim. They expressed concern at the “misuse of statistics in the courts”. Eventually, Sally Clark was acquitted in June 2003. What did the expert miss?

  1. He made an arithmetic error.
  2. He made two mistakes. First, he misused the multiplication rule and did not take into account how rare it is for a mother to murder her children. After using Bayes’ rule, we found a probability closer to 0.5 than 1 in 73 million.
  3. He mixed up the numerator and denominator of Bayes’ rule.
  4. He did not use R.

6. Florida is one of the most closely watched states in U.S. elections because it has many electoral votes. In past elections, Florida was a swing state where both Republicans and Democrats won implying it could affect a close elections.

Create the following table with the polls taken during the last two weeks:

library(tidyverse)
library(dslabs)
polls <- polls_us_election_2016 |> 
  filter(state == "Florida" & enddate >= "2016-11-04" ) |> 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

Take the average spread of these polls. The CLT tells us this average is approximately normal. Calculate an average and provide an estimate of the standard error. Save your results in an object called results.

7. Now assume a Bayesian model that sets the prior distribution for Florida’s election night spread \(\mu\) to follow a normal distribution with expected value \(\theta\) and standard deviation \(\tau\). What are the interpretations of \(\theta\) and \(\tau\)?

  1. \(\theta\) and \(\tau\) are arbitrary numbers that let us make probability statements about \(\mu\).
  2. \(\theta\) and \(\tau\) summarize what we would predict for Florida before seeing any polls. Based on past elections, we would set \(\mu\) close to 0, because both Republicans and Democrats have won, and \(\tau\) at about \(0.02\), because these elections tend to be close.
  3. \(\theta\) and \(\tau\) summarize what we want to be true. We therefore set \(\theta\) at \(0.10\) and \(\tau\) at \(0.01\).
  4. The choice of prior has no effect on Bayesian analysis.

8. The CLT tells us that our estimate of the spread \(\hat{\mu}\) has normal distribution with expected value \(\mu\) and standard deviation \(\sigma\) calculated in exercise 6. Use the formulas we provided for the posterior distribution to calculate the expected value of the posterior distribution if we set \(\theta = 0\) and \(\tau = 0.01\).

9. Now compute the standard deviation of the posterior distribution.

10. Using the fact that the posterior distribution is normal, create an interval that has a 95% probability of occurring centered at the posterior expected value. Note that we call these credible intervals.

11. According to this analysis, what was the probability that Trump wins Florida?

12. Now use sapply function to change the prior variance from seq(0.005, 0.05, len = 100) and observe how the probability changes by making a plot.


  1. https://en.wikipedia.org/wiki/Sally_Clark↩︎