2  Robust summaries

Note that the heights we explored in the Chapter 1 are not the original heights reported by students. A second challenge involves exploring the original reported heights, which are also included in the dslabs package in the reported_heights object. We will see that due to errors in reporting, using robust summaries are necessary to produce useful summaries.

2.1 Outliers

We previously described how boxplots show outliers, but we did not provide a precise definition. Here we discuss outliers, approaches that can help detect them, and summaries that take into account their presence.

Outliers are very common in real-world data analysis. Data recording can be complex and it is common to observe data points generated in error. For example, an old monitoring device may read out nonsensical measurements before completely failing. Human error is also a source of outliers, in particular when data entry is done manually. For example, an individual may mistakenly enter their height in centimeters instead of inches or put the decimal in the wrong place.

How do we distinguish an outlier from measurements that were too big or too small simply due to expected variability? This is not always an easy question to answer, but we try to provide some guidance. Let’s begin with a simple case.

Suppose a colleague is charged with collecting demography data for a group of males. The data report height in feet and are stored in the object:

library(dslabs)
str(outlier_example)
#>  num [1:500] 5.59 5.8 5.54 6.15 5.83 5.54 5.87 5.93 5.89 5.67 ...

Our colleague uses the fact that heights are usually well approximated by a normal distribution and summarizes the data with average and standard deviation:

mean(outlier_example)
#> [1] 6.1
sd(outlier_example)
#> [1] 7.8

and writes a report on the interesting fact that this group of males is much taller than usual. The average height is over six feet tall! Using your data analysis skills, however, you notice something else that is unexpected: the standard deviation is over 7 feet. Adding and subtracting two standard deviations, you note that 95% of this population will have heights between -9.4892954, 21.6969354 feet, which does not make sense. A quick plot reveals the problem:

boxplot(outlier_example)

There appears to be at least one value that is nonsensical, since we know that a height of 180 feet is impossible. The boxplot detects this point as an outlier.

2.2 The median

When we have an outlier like this, the average can become very large. Mathematically, we can make the average as large as we want by simply changing one number: with 500 data points, we can increase the average by any amount \(\Delta\) by adding \(\Delta \times\) 500 to a single number. The median, defined as the value for which half the values are smaller and the other half are bigger, is robust to such outliers. No matter how large we make the largest point, the median remains the same.

With this data the median is:

median(outlier_example)
#> [1] 5.74

which is about 5 feet and 9 inches.

The median is what boxplots display as a horizontal line.

2.3 The inter quartile range (IQR)

The box in boxplots is defined by the first and third quartile. These are meant to provide an idea of the variability in the data: 50% of the data is within this range. The difference between the 3rd and 1st quartile (or 75th and 25th percentiles) is referred to as the inter quartile range (IQR). As is the case with the median, this quantity will be robust to outliers as large values do not affect it. We can do some math to see that for normally distributed data, the IQR / 1.349 approximates the standard deviation of the data had an outlier not been present. We can see that this works well in our example, since we get a standard deviation estimate of:

IQR(outlier_example) / 1.349
#> [1] 0.245

which is about 3 inches.

2.4 A data-driven definition of outliers

In R, points falling outside the whiskers of the boxplot are referred to as outliers. This definition of outlier was introduced by John Tukey. The top whisker ends at the 75th percentile plus 1.5 \(\times\) IQR. Similarly the bottom whisker ends at the 25th percentile minus 1.5\(\times\) IQR. If we define the first and third quartiles as \(Q_1\) and \(Q_3\), respectively, then an outlier is anything outside the range:

\[[Q_1 - 1.5 \times (Q_3 - Q1), Q_3 + 1.5 \times (Q_3 - Q1)].\]

When the data is normally distributed, the standard units of these values are:

q3 <- qnorm(0.75)
q1 <- qnorm(0.25)
iqr <- q3 - q1
r <- c(q1 - 1.5*iqr, q3 + 1.5*iqr)
r
#> [1] -2.7  2.7

Using the pnorm function, we see that 99.3% of the data falls in this interval.

Keep in mind that this is not such an extreme event: if we have 1,000 data points that are normally distributed, we expect to see about 7 outside of this range. But these would not be outliers since we expect to see them under the typical variation.

If we want an outlier to be rarer, we can increase the 1.5 to a larger number. Tukey also used 3 and called these far out outliers. With a normal distribution, 99.9998% of the data falls in this interval. This translates into about 2 in a million chance of being outside the range. In the geom_boxplot function, this can be controlled by the outlier.size argument, which defaults to 1.5.

The 180 inches measurement is well beyond the range of the height data:

max_height <- quantile(outlier_example, 0.75) + 3*IQR(outlier_example)
max_height
#>  75% 
#> 6.91

If we take this value out, we can see that the data is in fact normally distributed as expected:

x <- outlier_example[outlier_example < max_height]
qqnorm(x)
qqline(x)

2.5 Median absolute deviation

Another way to robustly estimate the standard deviation in the presence of outliers is to use the median absolute deviation (MAD). To compute the MAD, we first compute the median, and then for each value we compute the distance between that value and the median. The MAD is defined as the median of these distances. For technical reasons not discussed here, this quantity needs to be multiplied by 1.4826 to assure it approximates the actual standard deviation. The mad function already incorporates this correction. For the height data, we get a MAD of:

mad(outlier_example)
#> [1] 0.237

which is about 3 inches.

2.6 Exercises

We are going to use the HistData package. Load the height data set and create a vector x, consisting solely of the male heights used in Galton’s data on the heights of parents and their children, part of his historic research on heredity.

library(HistData)
x <- Galton$child

1. Compute the average and median of these data.

2. Compute the median and median absolute deviation of these data.

3. Now suppose Galton made a mistake when entering the first value and forgot to use the decimal point. You can imitate this error by typing:

x_with_error <- x
x_with_error[1] <- x_with_error[1]*10

How many inches does the average grow after this mistake?

4. How many inches does the SD grow after this mistake?

5. How many inches does the median grow after this mistake?

6. How many inches does the MAD grow after this mistake?

7. How could you use exploratory data analysis to detect that an error was made?

  1. Since it is only one value out of many, we will not be able to detect this.
  2. We would see an obvious shift in the distribution.
  3. A boxplot, histogram, or qqplot would reveal a clear outlier.
  4. A scatterplot would show high levels of measurement error.

8. How much can the average accidentally grow with mistakes like this? Write a function called error_avg that takes a value k and returns the average of the vector x after the first entry changed to k. Show the results for k=10000 and k=-10000.

9. Using the murders dataset in the dslabs package. Compute the murder rate for each state. Make a boxplot comparing the murder rates for each region of the United States.

10. For the same dataset, compute the median and IQR murder rate for each region.

11. Add a column to the reported_heights with the year the height was entered. You can use the year function in the lubridate package to extract the year from reported_heights$time_stamp). Change the height column from characters to numbers using parse_number from the readr package. Some of the heights will be converted to NA because they were incorrectly entered and include characters, for example 165cm. These heights were supposed to be reported in inches, but many clearly did not. Convert any entry below 54 or above 72 to NA using the na_if function from dplyr. Once you do this, stratify by sex and year and report the percentage of incorrectly entered heights, represented by the NA.

12. The heights we have been looking at are not the original heights reported by students. The original reported heights are also included in the dslabs package in the object reported_heights. Note that the height column in this data frame is a character, and if we try to create a new column with the numeric version:

library(tidyverse)  
reported_heights <- reported_heights |>
  mutate(original_heights = height, height = as.numeric(height))

we get a warnings about NAs. Examine the rows that result in NAs and describe why this is happening.

13. Remove the entries that result in NAs when attempting to convert heights to numbers. Compute the mean, standard deviation, median, and MAD by sex. What do you notice?

14. Generate boxplots summarizing the heights for males and females and describe what you see.

15. Look at the largest 10 heights and provide a hypothesis for what you think is happening.

16. Review all the nonsensical answers by looking at the data considered to be far out by Tukey and comment on the type of errors you see.