19 Association Tests

The statistical models studied up to now are appropriate for continuous outcomes. We have not yet discussed inference for binary, categorical, and ordinal data. To give a very specific example, we will consider a case study examining funding success rates in the Netherlands, categorized by gender.

19.1 Case study: Funding success rates

A 2014 PNAS paper¹ analyzed success rates from funding agencies in the Netherlands and concluded that their

results reveal gender bias favoring male applicants over female applicants in the prioritization of their “quality of researcher” (but not “quality of proposal”) evaluations and success rates, as well as in the language use in instructional and evaluation materials.

The main evidence supporting this conclusion is based on a comparison of the percentages. Table S1 in the paper includes the information we need. Here are the three columns showing the overall outcomes:

library(tidyverse)
library(dslabs)
research_funding_rates |>
  select(discipline, applications_total, success_rates_total) |> head()
#>           discipline applications_total success_rates_total
#> 1  Chemical sciences                122                26.2
#> 2  Physical sciences                174                20.1
#> 3            Physics                 76                26.3
#> 4         Humanities                396                16.4
#> 5 Technical sciences                251                17.1
#> 6  Interdisciplinary                183                15.8

We also have these values for each gender:

names(research_funding_rates)
#>  [1] "discipline"          "applications_total"  "applications_men"   
#>  [4] "applications_women"  "awards_total"        "awards_men"         
#>  [7] "awards_women"        "success_rates_total" "success_rates_men"  
#> [10] "success_rates_women"

We can compute the totals that were successful, and the totals that were not, as follows:

totals <- research_funding_rates |> 
  select(-discipline) |> 
  summarize_all(sum) |>
  summarize(yes_men = awards_men, 
            no_men = applications_men - awards_men, 
            yes_women = awards_women, 
            no_women = applications_women - awards_women)

We see that a larger percent of men than women received awards:

totals |> summarize(percent_men = yes_men/(yes_men + no_men),
                    percent_women = yes_women/(yes_women + no_women))
#>   percent_men percent_women
#> 1       0.177         0.149

But could this be due just to random variability? Here we learn how to perform inference for this type of data.

19.2 Lady Tasting Tea

R.A. Fisher² was one of the first to formalize hypothesis testing. The “Lady Tasting Tea” is one of the most famous examples.

The story is as follows: an acquaintance of Fisher claimed that she could tell if milk was added before or after tea was poured. Fisher was skeptical and, consequently, designed an experiment to test this claim. He gave her four pairs of cups of tea: one with milk poured first, the other after. The order was randomized. The null hypothesis here is that she is guessing. Fisher derived the distribution for the number of correct picks on the assumption that the choices were random and independent.

As an example, suppose she identified 3 out of 4 correctly. Do we believe she has a special ability? The basic question we ask is: if the tester is actually guessing, what are the chances that she gets 3 or more correct? Just as we have done before, we can compute a probability under the null hypothesis that she is guessing for all 4. Under this null hypothesis, we can think of this particular example as picking 4 balls out of an urn with 4 blue (correct answer) and 4 red (incorrect answer) balls. Remember, she knows that there are four before tea and four after.

Under the null hypothesis that she is simply guessing, each ball has the same chance of being picked. We can then use combinations to determine each probability. The probability of picking 3 is $\binom{4}{3} \binom{4}{1} / \binom{8}{4} = 16/70$. The probability of picking all 4 correct is $\binom{4}{4} \binom{4}{0} / \binom{8}{4}= 1/70$. Thus, the chance of observing a 3 or something more extreme, under the null hypothesis, is $\approx 0.24$. This is the p-value. The procedure that produced this p-value is called Fisher’s exact test and it uses the hypergeometric distribution.

According to accounts, the lady did succeed under Fisher’s test. Fisher’s point wasn’t to prove her right or wrong, but to show how to design an experiment and a statistical test to evaluate such claims.

19.3 Two-by-two tables

The data from such experiments is usually summarized by a table like this:

tab <- matrix(c(3,1,1,3),2,2)
rownames(tab) <- c("Poured Before", "Poured After")
colnames(tab) <- c("Guessed before", "Guessed after")
tab
#>               Guessed before Guessed after
#> Poured Before              3             1
#> Poured After               1             3

A two-by-two table summarizes the relationship between two binary variables by displaying the observed counts for each of the four possible combinations.

The function fisher.test can be used to carry out the inference described above. For example:

fisher.test(tab, alternative = "greater")$p.value
#> [1] 0.243

In the next chapter, we will construct a two-by-two table for the funding success data. However, that analysis will require a different test than the hypergeometric approach.

19.4 Chi-square Test

Notice that, in a sense, our funding rates example is similar to the Lady Tasting Tea. However, in the Lady Tasting Tea example, the number of blue and red beads is experimentally fixed and the number of answers given for each category is also fixed. This is because Fisher ensured there were four cups with milk poured before tea and four cups with milk poured after, and the lady knew this. Therefore, the answers would also have to include four before and four afters. In this case, the sum of the rows and the sum of the columns are fixed. This defines constraints on the possible ways we can fill the two by two table and also allows us to use the hypergeometric distribution. In general, this is not the case. Nonetheless, there is another approach, the Chi-squared test, which is described below.

Imagine we have four totals of applicants (290, 1,345, 177, 1,011), some are men and some are women, and some get funded while others do not. We saw that the success rates for men and women respectively were:

totals |> summarize(percent_men = yes_men/(yes_men + no_men),
                    percent_women = yes_women/(yes_women + no_women))
#>   percent_men percent_women
#> 1       0.177         0.149

Would we see this again if we randomly assign funding at the overall rate:

rate <- with(totals, (yes_men + yes_women))/sum(totals)
rate
#> [1] 0.165

The Chi-square test answers this question. The first step is to create the observed two-by-two data table:

o <- with(totals, data.frame(men = c(no_men, yes_men),
                             women = c(no_women, yes_women),
                             row.names = c("no", "yes")))

The general idea of the Chi-square test is to compare the observed two-by-two table to what you expect to see, which would be:

e <- with(totals, data.frame(men = (no_men + yes_men) * c(1 - rate, rate),
                             women = (no_women + yes_women) * c(1 - rate, rate),
                             row.names = c("no", "yes")))

We can see that more men than expected and fewer women than expected received funding.

cbind(o, e)
#>      men women  men women
#> no  1345  1011 1365   991
#> yes  290   177  270   197

However, under the null hypothesis these observations are random variables. The Chi-square statistic quantifies how much the observed tables deviates from the expected by:

Taking the difference between each observed and expected cell value.
Squaring this difference.
Dividing each squared difference by the expected value.
Summing all these values together to get the final statistic.

sum((o - e)^2/e)
#> [1] 4.01

The Chi-square test tells us how likely it is to see a deviation this large or larger. This test uses an asymptotic result, similar to the CLT, related to the sums of independent binary outcomes. The R function chisq.test takes a two-by-two table and returns the results from the test:

chisq_test <- chisq.test(o, correct = FALSE)

We see that the p-value is 0.045:

chisq_test$p.value
#> [1] 0.0451

By default, the chisq.test function applies a continuity correction achieved by subtracting 0.5 in the following way:

sum((abs(o - e) - 0.5)^2/e)
#> [1] 3.81

Note that it matches the default behavior:

chisq.test(o)$statistic
#> X-squared 
#>      3.81

19.5 Generalized linear models

We presented a way to perform hypothesis testing for determining if there is association between two binary outcomes. But we have not yet described how to quantify effects. Can we estimate the effect of being a woman in funding success in the Netherlands? Note that if our outcomes are binary, then the linear models presented in Chapter 18 are not appropriate because the $\beta$s and $\varepsilon$ are continuous. However, an adaptation of these methods, that is widely used in, for example, medical studies, gives us a way to estimate effects along with their standard errors.

The idea is to model a transformation of the expected value of the outcomes with a linear model. The transformation is selected so that any continuous value is possible. The mathematical equation for a model with one variable looks like this:

\[ g\{\mathrm{E}(Y_i)\} = \beta_0 + \beta_1 x_i \]

To finish describing the model, we impose a distribution on $Y$, such as binomial or Poisson. These are referred to as generalized linear models.

We illustrate this with the funding rates example. We define $Y_i$ to be 1 if person $i$ received funding and 0 otherwise, and $x_i$ to be 1 for person $i$ is a woman and 0 if a man. For this data, the expected value of $Y_i$ is the probability of funding for person $i$ $\mathrm{Pr}(Y_i=1)$. We assume the outcomes $Y_i$ are binomial, with $N=1$ and probability $p_i$. For binomial data, the most widely used transformation is the logit function $g(p) = \log \{p / (1-p)\}$, which maps numbers between 0 and 1 to any continuous number. The model looks like this:

\[ \log \frac{\mathrm{Pr}(Y_i=1)}{1-\mathrm{Pr}(Y_i=1)} = \beta_0 + \beta_1 x_i \]

The log odds ratio

To understand how $\beta_1$ can be used to quantify the effect of being a woman on success rates, first note that $\mathrm{Pr}(Y_i=1)/\left(1-\mathrm{Pr}(Y_i=1)\right) = \mathrm{Pr}(Y_i=1)/\mathrm{Pr}(Y_i=0)$ is the odds of person $i$ getting funding: the ratio of the probability of success and probability of failure. This implies that $e^{\beta_0}$ is the odds for men and $e^{\beta_0}e^{\beta_1}$ is the odds for women, which implies $e^{\beta_1}$ is the odds for women divided by the odds for men. This quantity is called the odds ratio. To see this, note that if use $p_1$ and $p_0$ to denote the probability of success for women and men, respectively, then $e^{\beta_1}$ can be rewritten as:

\[ e^{\beta_1} = \frac{p_1}{1-p_1} \, / \, \frac{p_0}{1-p_0} \]

The parameter $\beta_1$ therefore quantifies the log odds ratio.

Now how do we estimate these parameters? Although the details are not described in this book, least squares is no longer the optimal way and instead we use an approach called maximum likelihood estimation (MLE).

More advanced mathematical derivations show that a version of the central limit theorem applies, and the estimates obtained this way are approximately normal when the number of observations is large. The theory also provides a way to calculate standard errors for the estimates of the $\beta$s.

Fitting the model

To obtain the maximum likelihood estimates using R, we can use the glm function with the family argument set to binomial. This defaults to using the logit transformation. Note that we do not have the individual level data, but because our model assumes the probability of success is the same for all women and all men, then the number of success can be modeled as binomial with $N_1$ trials and probability $p_1$ for women and binomial with $N_0$ trials and probability $p_0$ for men, where $N_1$ and $N_0$ are the total number of women and men. In this case, the glm function is used as follows:

success <- with(totals, c(yes_men, yes_women))
failure <- with(totals, c(no_men, no_women))
gender <- factor(c("men", "women"))
fit <- glm(cbind(success, failure) ~ gender, family = "binomial") 
coefficients(summary(fit))
#>             Estimate Std. Error z value  Pr(>|z|)
#> (Intercept)   -1.534     0.0647   -23.7 3.83e-124
#> genderwomen   -0.208     0.1041    -2.0  4.54e-02

The estimate of the odds ratio is exp(fit$coef[2]) = 0.811982, interpreted as the odds being lowered by 20% for women compared to men. But is this due to chance? We already noted that the p-value is about 0.05, but the GLM approach also permits us to compute confidence intervals using the confint function. To show the interval for the more interpretable odds ratio, we simply exponentiate:

exp(confint(fit, 2))
#>  2.5 % 97.5 % 
#>  0.661  0.995

We have used a simple version of GLMs in which the only variable is binary. However, the method can be expanded to incorporate multiple variables, including continuous ones. In these contexts, the log odds ratio interpretation becomes more complex. Also, note that we have shown just one version of GLM appropriate for binomial data using a logit transformation. This version is often referred to as logistic regression. Nevertheless, GLM can be used with other transformation and distributions. You can learn more by consulting a GLM textbook.

Simple standard error approximation for two-by-two table odds ratio

Using glm, we can obtain estimates, standard errors, and confidence intervals for a wide range of models. To do this, we use a rather complex algorithm. In the case of two-by-two tables, we can obtain a standard error for the log odds ratio using a simple approximation.

If two-by-two tables have the following entries:

	Men	Women
Awarded	a	b
Not Awarded	c	d

In this case, the log odds ratio is simply

\[ \frac{a/c}{b/d} = \frac{ad}{bc}, \]

and the standard error for the log odds ratio is:

\[ \sqrt{1/a + 1/b + 1/c + 1/d} \]

This implies that a 95% confidence interval for the log odds ratio can be formed by:

\[ \log\left(\frac{ad}{bc}\right) \pm 1.96 \sqrt{1/a + 1/b + 1/c + 1/d} \]

We can easily compute these quantities in R using:

two_by_two <- with(totals, data.frame(men = c(no_men, yes_men),
                                      women = c(no_women, yes_women)))
or <- two_by_two[1,1]*two_by_two[2,2]/(two_by_two[2,1]*two_by_two[1,2])
se <- sqrt(sum(1/two_by_two))

To form a confidence interval for the odds ration we simply exponentiate the confidence interval for the log odds ratio:

exp(log(or) + c(-1,1) * qnorm(0.975) * se)
#> [1] 0.662 0.996

We can confirm that we obtain the same estimate as when using glm, stored in fit:

c(log(or), tidy(fit)$estimate[2])
#> [1] -0.208 -0.208
c(se, tidy(fit)$std.error[2])
#> [1] 0.104 0.104

Keep in mind that the p-values obtained with chisq.test and glm are slightly different. This is because these are both based on different approximation approaches.

19.6 Large samples, small p-values

As mentioned earlier, reporting only p-values is not always a useful way to report the results of data analysis. In scientific journals, some studies seem to overemphasize p-values. Some of these studies have large sample sizes and report impressively small p-values. Yet by looking closely at the results, we realize that the odds ratios are quite modest: barely bigger than 1. In this case, the difference may not be practically significant or scientifically significant.

Note that the relationship between odds ratio and p-value is not one-to-one; it depends on the sample size. Therefore, a very small p-value does not necessarily mean a very large odds ratio. Observe what happens to the p-value if we multiply our two-by-two table by 10, which does not change the odds ratio:

two_by_two_x_10 <- two_by_two |> mutate(men = men*10, women = women*10) 
chisq.test(two_by_two_x_10)$p.value
#> [1] 2.63e-10

Also, note that the log odds ratio is not defined if any of the cells of the two-by-two table is 0. This is because if $a$, $b$, $c$, or $d$ are 0, the $\log(\frac{ad}{bc})$ is either the log of 0 or has a 0 in the denominator. For this situation, it is common practice to avoid 0s by adding 0.5 to each cell. This is referred to as the Haldane–Anscombe correction and has been shown, both in practice and theory, to work well.

19.7 Exercises

1. A famous athlete boasts an impressive career, winning 70% of her 500 career matches. Nevertheless, this athlete is criticized because in important events, such as the Olympics, she has a losing record of 8 wins and 9 losses. Perform a Chi-square test to determine if this losing record can be simply due to chance as opposed to not performing well under pressure.

2. Why did we use the Chi-square test instead of Fisher’s exact test in the previous exercise?

It actually does not matter, since they give the exact same p-value.
Fisher’s exact and the Chi-square are different names for the same test.
Because the sum of the rows and columns of the two-by-two table are not fixed so the hypergeometric distribution is not an appropriate assumption for the null hypothesis. For this reason, Fisher’s exact test is rarely applicable with observational data.
Because the Chi-square test runs faster.

3. Compute the odds ratio of “losing under pressure” along with a confidence interval.

4. Notice that the p-value is larger than 0.05, but the 95% confidence interval does not include 1. What explains this?

We made a mistake in our code.
These are based on t-statistics so the connection between p-value and confidence intervals does not apply.
Different approximations are used for the p-value and the confidence interval calculation. If we had a larger sample size, the match would be better.
We should use the Fisher exact test to get confidence intervals.

5. Multiply the two-by-two table by 2 and see if the p-value and confidence interval are a better match.

6. FIX Use the research_funding_rates data to estimate the log odds ratio and standard errors comparing women to men for each discipline. Compute a confidence interval and report all the disciplines for which one gender appears to be favored over the other.

7. Divide the log odds ratio estimates by their respective standard errors and generate a qqplot comparing these to a standard normal. Do any of the disciplines deviate from what is expected by chance?

8. During the 2016 US presidential election, then candidate Donald J. Trump used his twitter account as a way to communicate with potential voters. Todd Vaziri hypothesized that “Every non-hyperbolic tweet is from iPhone (his staff). Every hyperbolic tweet is from Android (from him).” We will test this hypothesis using association tests. The dslabs object sentiment_counts provides a table with the counts for several sentiments from each source (Android or iPhone):

library(tidyverse)
library(dslabs)
sentiment_counts

Compute an odds ratio comparing Android to iPhone for each sentiment and add it to the table.

9. Compute a 95% confidence interval for each odds ratio.

10. Generate a plot showing the estimated odds ratios along with their confidence intervals.

11. FIX Test the null hypothesis that there is no difference between tweets from Android and iPhone and report the sentiments with p-values less than 0.05 and more likely to come from Android.

12. For each sentiment, find the words assigned to that sentiment, keep words that appear at least 25 times, compute the odd ratio for each, and show a barplot for those with odds ratio larger than 2 or smaller than 1/2.

http://www.pnas.org/content/112/40/12349.abstract↩︎
https://en.wikipedia.org/wiki/Ronald_Fisher↩︎