# 17  Association tests

The statistical models studied up to now are appropriate for continuous outcomes. We have not yet discussed inference for binary, categorical, and ordinal data. To give a very specific example, we will consider a case study examining funding success rates in the Netherlands, by gender.

## 17.1 Case study: funding success rates

A 2014 PNAS paper1 analyzed success rates from funding agencies in the Netherlands and concluded that their:

results reveal gender bias favoring male applicants over female applicants in the prioritization of their “quality of researcher” (but not “quality of proposal”) evaluations and success rates, as well as in the language use in instructional and evaluation materials.

The main evidence for this conclusion comes down to a comparison of the percentages. Table S1 in the paper includes the information we need. Here are the three columns showing the overall outcomes:

library(tidyverse)
library(dslabs)
research_funding_rates |> select(discipline, applications_total,
#>           discipline applications_total success_rates_total
#> 1  Chemical sciences                122                26.2
#> 2  Physical sciences                174                20.1
#> 3            Physics                 76                26.3
#> 4         Humanities                396                16.4
#> 5 Technical sciences                251                17.1
#> 6  Interdisciplinary                183                15.8

We have these values for each gender:

names(research_funding_rates)
#>   "discipline"          "applications_total"  "applications_men"
#>   "applications_women"  "awards_total"        "awards_men"
#>   "awards_women"        "success_rates_total" "success_rates_men"
#>  "success_rates_women"

We can compute the totals that were successful and the totals that were not as follows:

totals <- research_funding_rates |>
select(-discipline) |>
summarize_all(sum) |>
summarize(yes_men = awards_men,
no_men = applications_men - awards_men,
yes_women = awards_women,
no_women = applications_women - awards_women) 

So we see that a larger percent of men than women received awards:

totals |> summarize(percent_men = yes_men/(yes_men+no_men),
percent_women = yes_women/(yes_women+no_women))
#>   percent_men percent_women
#> 1       0.177         0.149

But could this be due just to random variability? Here we learn how to perform inference for this type of data.

R.A. Fisher2 was one of the first to formalize hypothesis testing. The “Lady Tasting Tea” is one of the most famous examples.

The story is as follows: an acquaintance of Fisher’s claimed that she could tell if milk was added before or after tea was poured. Fisher was skeptical. He designed an experiment to test this claim. He gave her four pairs of cups of tea: one with milk poured first, the other after. The order was randomized. The null hypothesis here is that she is guessing. Fisher derived the distribution for the number of correct picks on the assumption that the choices were random and independent.

As an example, suppose she picked 3 out of 4 correctly. Do we believe she has a special ability? The basic question we ask is: if the tester is actually guessing, what are the chances that she gets 3 or more correct? Just as we have done before, we can compute a probability under the null hypothesis that she is guessing 4 of each. Under this null hypothesis, we can think of this particular example as picking 4 balls out of an urn with 4 blue (correct answer) and 4 red (incorrect answer) balls. Remember, she knows that there are four before tea and four after.

Under the null hypothesis that she is simply guessing, each ball has the same chance of being picked. We can then use combinations to figure out each probability. The probability of picking 3 is $$\binom{4}{3} \binom{4}{1} / \binom{8}{4} = 16/70$$. The probability of picking all 4 correct is $$\binom{4}{4} \binom{4}{0} / \binom{8}{4}= 1/70$$. Thus, the chance of observing a 3 or something more extreme, under the null hypothesis, is $$\approx 0.24$$. This is the p-value. The procedure that produced this p-value is called Fisher’s exact test and it uses the hypergeometric distribution.

## 17.3 Two-by-two tables

The data from the experiment is usually summarized by a table like this:

tab <- matrix(c(3,1,1,3),2,2)
rownames(tab)<-c("Poured Before","Poured After")
colnames(tab)<-c("Guessed before","Guessed after")
tab
#>               Guessed before Guessed after
#> Poured Before              3             1
#> Poured After               1             3

These are referred to as a two-by-two table. For each of the four combinations one can get with a pair of binary variables, they show the observed counts for each occurrence.

The function fisher.test performs the inference calculations above:

fisher.test(tab, alternative="greater")$p.value #>  0.243 ## 17.4 Chi-square Test Notice that, in a way, our funding rates example is similar to the Lady Tasting Tea. However, in the Lady Tasting Tea example, the number of blue and red beads is experimentally fixed and the number of answers given for each category is also fixed. This is because Fisher made sure there were four cups with milk poured before tea and four cups with milk poured after and the lady knew this, so the answers would also have to include four before and four afters. If this is the case, the sum of the rows and the sum of the columns are fixed. This defines constraints on the possible ways we can fill the two by two table and also permits us to use the hypergeometric distribution. In general, this is not the case. Nonetheless, there is another approach, the Chi-squared test, which is described below. Imagine we have 290, 1,345, 177, 1,011 applicants, some are men and some are women and some get funded, whereas others don’t. We saw that the success rates for men and woman were: totals |> summarize(percent_men = yes_men/(yes_men+no_men), percent_women = yes_women/(yes_women+no_women)) #> percent_men percent_women #> 1 0.177 0.149 respectively. Would we see this again if we randomly assign funding at the overall rate: rate <- with(totals, (yes_men + yes_women))/sum(totals) rate #>  0.165 The Chi-square test answers this question. The first step is to create the two-by-two data table: two_by_two <- with(totals, data.frame(awarded = c("no", "yes"), men = c(no_men, yes_men), women = c(no_women, yes_women))) two_by_two #> awarded men women #> 1 no 1345 1011 #> 2 yes 290 177 The general idea of the Chi-square test is to compare this two-by-two table to what you expect to see, which would be: with(totals, data.frame(awarded = c("no", "yes"), men = (no_men + yes_men) * c(1 - rate, rate), women = (no_women + yes_women) * c(1 - rate, rate))) #> awarded men women #> 1 no 1365 991 #> 2 yes 270 197 We can see that more men than expected and fewer women than expected received funding. However, under the null hypothesis these observations are random variables. The Chi-square test tells us how likely it is to see a deviation this large or larger. This test uses an asymptotic result, similar to the CLT, related to the sums of independent binary outcomes. The R function chisq.test takes a two-by-two table and returns the results from the test: chisq_test <- chisq.test(two_by_two[, -1]) We see that the p-value is 0.0509: chisq_test$p.value
#>  0.0509

## 17.5 Generalized linear models

We presented a way to perform hypothesis testing for determining if there is association between two binary outcome. But we have not yet described how to quantify effects. Can we estimate the effect of being a woman in funding success in the Netherlands? Note that if our outcomes are binary, then the linear models presented in the Chapter Chapter 16 are not appropriate because the $$\beta$$s and $$\varepsilon$$ are continuous. However, an adaptation of these methods, that is widely used in, for example, medical studies, gives us a way to estimate effects along with their standard errors.

The idea is to model a transformation of the expected value of the outcomes with a linear model. The transformation is selected so that any continuous value is possible. The mathematical equation for a model with one variable looks like this:

$g\{\mbox{E}(Y_i)\} = \beta_0 + \beta_1 x_i$

To finish describing the model we impose a distribution on $$Y$$ such as binomial or Poisson. These are referred to as _generalized linear models.

We demonstrate with the funding rates example. We define $$Y_i$$ to be 1 if person $$i$$ received funding and 0 otherwise and $$x_i$$ to be 1 for person $$i$$ is a women and 0 for men. For this data the expected value of $$Y_i$$ is the probability of funding for person $$i$$ $$\mbox{Pr}(Y_i=1)$$. We assume the outcomes $$Y_i$$ are binomial with $$N=1$$ and probability $$p_i$$. For binomial data, the most widely used transformation is the logit function $$g(p) = \log \{p / (1-p)\}$$ which takes numbers between 0 and 1 to any continuous number. The model looks like this:

$\log \frac{\mbox{Pr}(Y_i=1)}{1-\mbox{Pr}(Y_i=1)} = \beta_0 + \beta_1 x_i$

### 17.5.1 The odds ratio

To understand how $$\beta_1$$ can be used to quantify the effect of being a woman on success rates, first note that $$\mbox{Pr}(Y_i=1)/\{1-\mbox{Pr}(Y_i=1)\} = \mbox{Pr}(Y_i=1)/\mbox{Pr}(Y_i=0)$$ is the odds of person $$i$$ getting funding: the ratio of the probability of success and probability of failure. This implies that $$e^{\beta_0}$$ is the odds for men and $$e^{\beta_0}e^{\beta_1}$$ is the odds for women, which implies $$\beta_1$$ is the odds for women divided by the odds for men. This quantity is called the odds ratio. To see this not that if use $$p_1$$ and $$p_0$$ to denote the probability of success for women and men, respectively, then $$e^\{beta_1$$ can be rewritten as

$e^{\beta_1} = \frac{p_1}{1-p_1} \, / \, \frac{p_0}{1-p_0}$

$$\beta_1$$ therefore quantifies the log odds ratio.

Now how do we estimate these parameters? Although the details are not described in this book, least squares is no longer an optimal way of estimating the parameters and instead we use an approach called maximum likelihood estimation (MLE). More advanced mathematical derivations show that a version of the central limit theorem applies and the estimates obtained this way are approximately normal when th number of observations is large. The theory also provides a way to calculate standard errors for the estimates of the $$\beta$$s.

### 17.5.2 Fitting the model

To obtain the maximum likelihood estimates using R we can use the glm function with the family argument set to binomial. This defaults to using the logit transformation. Note that we do not have the individual level data, but because we our model assumes the probability of success is the same for all women and all men, then the number of success can be modeled as binomial with $$N_1$$ trials and probability $$p_1$$ for women and binomial with $$N_0$$ trials and probability $$p_0$$ for men, with $$N_1$$ and $$N_0$$ the total number of women and men. In this case the glm function is used like this:

success <- with(totals, c(yes_men, yes_women))
failure <- with(totals, c(no_men, no_women))
gender <- factor(c("men", "women"))
fit <- glm(cbind(success, failure) ~ gender, family = "binomial")
coefficients(summary(fit))
#>             Estimate Std. Error z value  Pr(>|z|)
#> (Intercept)   -1.534     0.0647   -23.7 3.83e-124
#> genderwomen   -0.208     0.1041    -2.0  4.54e-02

The estimate of the odds ratio is 0.811982 which is interpreted as the odds being lowered by 20% for women as compared to men. But is this due to chance? We already noted that the p-value is about 0.05, but the GLM approach also permits us to compute confidence intervals using the confint function. To show the interval for the more interpretable odds ratio we simply exponentiate:

exp(confint(fit, 2))
#>  2.5 % 97.5 %
#>  0.661  0.995

We have used a simple version of GLMs in which the only variable is binary. However, the method can be expanded to use multiple variables including continuous ones. However, in these contexts the log odds ratio interpretation becomes more complex. Also note that we have shown just one version of GLM appropriate for binomial data using a logit transformation. This version is referred to often referred to as logistic regression. However, GLM can be used with other transformation and distributions. You can learn more by consulting a GLM text book.

### 17.5.3 Simple standard error approximation for two-by-two table odds ratio

Using glm we can obtain estimates, standard errors, and confidence intervals for a wide range of models. To do this we use a rather complex algorithms. In the case of two-by-two tables we can obtain a standard error for the log odds ratio using a simple approximation.

If our two-by-two tables has the following entries:

Men Women
Awarded a b
Not Awarded c d

In this case, the odds ratio is simply $$\frac{a/c}{b/d} = \frac{ad}{bc}$$. We can confirm we obtain the same estimate as when using glm:

or <- with(two_by_two, women/sum(women) / (women/sum(women)) / ((men/sum(men)) / (men/sum(men))))
c(log(or), fit$coef) #> genderwomen #> -0.208 -0.208 Statistical theory tells us that when all four entries of the two-by-two table are large enough, then the log odds ratio is approximately normal with standard error $\sqrt{1/a + 1/b + 1/c + 1/d}$ This implies that a 95% confidence interval for the log odds ratio can be formed by: $\log\left(\frac{ad}{bc}\right) \pm 1.96 \sqrt{1/a + 1/b + 1/c + 1/d}$ By exponentiating these two numbers we can construct a confidence interval of the odds ratio. Using R we can compute this confidence interval as follows: se <- two_by_two |> select(-awarded) |> summarize(se = sqrt(sum(1/men) + sum(1/women))) |> pull(se) exp(log(or) + c(-1,1) * qnorm(0.975) * se) #>  0.662 0.996 Note that 1 is not included in the confidence interval which must mean that the p-value is smaller than 0.05. We can confirm this using: 2*(1 - pnorm(abs(log(or)), 0, se)) #>  0.0454 Note that the p-values obtained with chisq.test, glm and this simple approximation are all slightly different. This is because these are both based on different approximation approaches. ## 17.6 Large samples, small p-values As mentioned earlier, reporting only p-values is not an appropriate way to report the results of data analysis. In scientific journals, for example, some studies seem to overemphasize p-values. Some of these studies have large sample sizes and report impressively small p-values. Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1. In this case the difference may not be practically significant or scientifically significant. Note that the relationship between odds ratio and p-value is not one-to-one. It depends on the sample size. So a very small p-value does not necessarily mean a very large odds ratio. Notice what happens to the p-value if we multiply our two-by-two table by 10, which does not change the odds ratio: two_by_two_x_10 <- two_by_two |> select(-awarded) |> mutate(men = men*10, women = women*10) chisq.test(two_by_two_x_10)$p.value
#>  2.63e-10

:::{.callout-note title = “Small count correction”} Note that the log odds ratio is not defined if any of the cells of the two-by-two table is 0. This is because if $$a$$, $$b$$, $$c$$, or $$d$$ is 0, the $$\log(\frac{ad}{bc})$$ is either the log of 0 or has a 0 in the denominator. For this situation, it is common practice to avoid 0s by adding 0.5 to each cell. This is referred to as the Haldane–Anscombe correction and has been shown, both in practice and theory, to work well. :::

## 17.7 Exercises

1. A famous athlete has an impressive career, winning 70% of her 500 career matches. However, this athlete gets criticized because in important events, such as the Olympics, she has a losing record of 8 wins and 9 losses. Perform a Chi-square test to determine if this losing record can be simply due to chance as opposed to not performing well under pressure.

2. Why did we use the Chi-square test instead of Fisher’s exact test in the previous exercise?

1. It actually does not matter, since they give the exact same p-value.
2. Fisher’s exact and the Chi-square are different names for the same test.
3. Because the sum of the rows and columns of the two-by-two table are not fixed so the hypergeometric distribution is not an appropriate assumption for the null hypothesis. For this reason, Fisher’s exact test is rarely applicable with observational data.
4. Because the Chi-square test runs faster.

3. Compute the odds ratio of “losing under pressure” along with a confidence interval.

4. Notice that the p-value is larger than 0.05 but the 95% confidence interval does not include 1. What explains this?

1. We made a mistake in our code.
2. These are based on t-statistics so the connection between p-value and confidence intervals does not apply.
3. Different approximations are used for the p-value and the confidence interval calculation. If we had a larger sample size the match would be better.
4. We should use the Fisher exact test to get confidence intervals.

5. Multiply the two-by-two table by 2 and see if the p-value and confidence retrieval are a better match.

6. During the 2016 US presidential election, then candidate Donald J. Trump used his twitter account as a way to communicate with potential voters. Todd Vaziri hypothesized that “Every non-hyperbolic tweet is from iPhone (his staff). Every hyperbolic tweet is from Android (from him).” We will test this hypothesis using association tests. The following code coverts the tweets to a table with the counts for several sentiments from each source (Android or iPhone):

library(tidyverse)
library(dslabs)
library(tidytext)

nrc <- get_sentiments("nrc") |> select(word, sentiment)
android_iphone <- trump_tweets |>
extract(source, "source", "Twitter for (.*)") |>
filter(source %in% c("Android", "iPhone") &
created_at >= ymd("2015-06-17") &
created_at < ymd("2016-11-08")) |>
filter(!is_retweet) |>
arrange(created_at) |>
mutate(text = str_replace_all(text, links, ""))  |>
unnest_tokens(word, text) |>
filter(!word %in% stop_words$word & !str_detect(word, "^\\d+$")) |>
mutate(word = str_replace(word, "^'", "")) |>
filter(!word %in% stop_words\$word)

sentiment_counts <- android_iphone |>
left_join(nrc, by = "word", relationship = "many-to-many") |>
count(source, sentiment) |>
pivot_wider(names_from = "source", values_from = "n") |>
mutate(sentiment = replace_na(sentiment, replace = "none"))

Compute an odds ratio comparing Android to iPhone for each sentiment and add it to the table.

7. Compute a 95% confidence interval for each odds ratio.

8. Generate a plot showing the estimated odds ratios along with their confidence intervals.

9. Test the null hypothesis that there is no difference between tweets from Android and iPhone and report the sentiments with p-values less than 0.05 and more likely to come from Android.

10. For each sentiment, find the words assigned to that sentiment, keep words that appear at least 25 times, compute the odd ratio for each, and show a barplot for those with odds ratio larger than 2 or smaller than 1/2.

1. http://www.pnas.org/content/112/40/12349.abstract↩︎

2. https://en.wikipedia.org/wiki/Ronald_Fisher↩︎