9  Hypothesis testing

In scientific studies, you’ll often see phrases like “the results are statistically significant”. This points to a technique called hypothesis testing, where we use p-values, a type of probability, to test our initial assumption or hypothesis.

In hypothesis testing, rather than providing an estimate of the parameter we’re studying, we provide a probability that serves as evidence supporting or contradicting a specific hypothesis. The hypothesis usually involves whether a parameter is different from a predetermined value (often 0).

Hypothesis testing is used when you can phrase your research question in terms of whether a parameter differs from this predetermined value. It’s applied in various fields, asking questions like: Does a medication extend the lives of cancer patients? Does an increase in gun sales correlate with more gun violence? Does class size affect test scores?

Take for instance the previously used example with colored beads. We might not care about the exact proportion of blue beads, but instead ask: Are there more blue beads than red ones? This could be rephrased as asking if the proportion of blue beads is more than 0.5.

The initial hypothesis that the parameter equals the predetermined value is called the “null hypothesis”. It’s popular because it allows us to focus on the data’s properties under this null scenario. Once data is collected, we estimate the parameter and calculate the p-value, which is the probability of the estimate being as extreme as observed if the null hypothesis is true. If the p-value is small, it indicates the null hypothesis is unlikely, offering evidence against it.

We will see more examples of hypothesis testing in Chapter Chapter 16.

9.1 p-values

Suppose we take a random sample of \(N=100\) and we observe \(52\) blue beads, which gives us \(\bar{X} = 0.52\). This seems to be pointing to the existence of more blue than red beads since 0.52 is larger than 0.5. However, we know there is chance involved in this process and we could get a 52 even when the actual \(p=0.5\). We call the assumption that \(p = 0.5\) a null hypothesis. The null hypothesis is the skeptic’s hypothesis.

We have observed a random variable \(\bar{X} = 0.52\) and the p-value is the answer to the question: how likely is it to see a value this large, when the null hypothesis is true? If the p-value is small enough we reject the null hypothesis and say that the results are statistically significant.

The p-value of 0.05 as a threshold for statistical significance is conventionally used in many areas of research. A cutoff of 0.01 is also used to define highly significance. The choice of 0.05 is somewhat arbitrary and was popularized by the British statistician Ronald Fisher in the 1920s. We do not recommend using the cutoff without justification and try to avoid the phrase statistically significance.

To obtain a p-value for our example we write:

\[\mbox{Pr}(\mid \bar{X} - 0.5 \mid > 0.02 ) \]

assuming the \(p=0.5\). Under the null hypothesis we know that:

\[ \sqrt{N}\frac{\bar{X} - 0.5}{\sqrt{0.5(1-0.5)}} \]

is standard normal. We therefore can compute the probability above, which is the p-value.

\[\mbox{Pr}\left(\sqrt{N}\frac{\mid \bar{X} - 0.5\mid}{\sqrt{0.5(1-0.5)}} > \sqrt{N} \frac{0.02}{ \sqrt{0.5(1-0.5)}}\right)\]

N <- 100
z <- sqrt(N)*0.02/0.5
1 - (pnorm(z) - pnorm(-z))
#> [1] 0.689

In this case, there is actually a large chance of seeing 52 or larger under the null hypothesis.

Keep in mind that there is a close connection between p-values and confidence intervals. If a 95% confidence interval of the spread does not include 0, we know that the p-value must be smaller than 0.05.

To learn more about p-values, you can consult any statistics textbook. However, in general, we prefer reporting confidence intervals over p-values since it gives us an idea of the size of the estimate. If we just report the p-value we provide no information about the significance of the finding in the context of the problem.

We can show mathematically that if a \((1-\alpha)\times 100\)% confidence interval does not contain the null hypothesis value, the null hypothesis is rejected with a p-value as smaller or smaller than \(\alpha\). So statistical significance can be determined from confidence intervals. However, unlike the confidence interval, the p-value does not provide an estimate of the magnitude of the effect. For this reason we recommend avoiding p-values whenever you can compute a confidence interval.

9.2 Power

Pollsters are not successful at providing correct confidence intervals, but rather at predicting who will win. When we took a 25 bead sample size, the confidence interval for the spread:

N <- 25
x_hat <- 0.48
(2 * x_hat - 1) + c(-1.96, 1.96) * 2 * sqrt(x_hat * (1 - x_hat) / N)
#> [1] -0.432  0.352

includes 0. If this were a poll and we were forced to make a declaration, we would have to say it was a “toss-up”.

A problem with our poll results is that given the sample size and the value of \(p\), we would have to sacrifice on the probability of an incorrect call to create an interval that does not include 0.

This does not mean that the election is close. It only means that we have a small sample size. In statistical textbooks this is called lack of power. In the context of polls, power is the probability of detecting spreads different from 0.

By increasing our sample size, we lower our standard error and therefore have a much better chance of detecting the direction of the spread.

9.3 Exercises

  1. Generate a sample of size \(N=50\) from an urn model with 50% blue beads:
N <- 50 
p <- 0.5
x <- rbinom(N, 1, 0.5)

Then compute a p-value testing if \(p=0.5\). Repeat this 10,000 times and report how often do we incorrectly is the p-value lower than 0.05? How often is it lower than 0.01?

  1. Make a histogram of the p-values you generated in exercise 1. Which of the following seems to be true:
  1. The p-values are all 0.05
  2. The p-values are normally distributed, CLT seems to hold.
  3. The p-values are uniformly distributed
  4. The p-values
  1. (Advanced) Demonstrate mathematically why see the histogram you see.

  2. Generate a sample of size \(N=50\) from an urn model with 52% blue beads:

N <- 50 
p <- 0.52
x <- rbinom(N, 1, 0.5)

Then compute a p-value testing if \(p=0.5\). Repeat this 10,000 times and report how often do we incorrectly is the p-value larger than 0.05? Note that you are computing 1 - power.

  1. Repeat exercise for but for the following values:
values <- expand.grid(N = c(25, 50, 100, 500, 1000), p = seq(0.51 ,0.75, 0.01))

Plot power as a function of \(N\) with a different color curve for each value of p.