5 Foundations of Statistical Inference
Most data we work with is affected by chance, whether it comes from a random sample, is subject to measurement error, or reflects outcomes that are inherently random. Being able to quantify the uncertainty introduced by this randomness is one of the most important responsibilities of a data analyst. Statistical inference provides a framework, along with practical tools, to accomplish this. The first step is learning how to mathematically describe random variables. In this chapter, we introduce random variables and their key properties, beginning with simple examples from games of chance.
5.1 Random variables
Statistical inference begins with the concept of a random variable, a numeric quantity whose value depends on the outcome of a random process. Random variables connect probability theory to data analysis by giving us a way to describe uncertainty mathematically. Once we define a random variable, we can study its distribution, compute its expected value and variability, and use it to make probability-based statements about future or unseen outcomes.
In this chapter, we focus on cases in which the probability distribution of the random variable is completely known. These are idealized settings, such as games of chance, where the probabilities are determined by the rules of the game rather than by data. Working with these simple, controlled examples allows us to understand the mathematical foundations of probability, expected values, standard errors, and sampling distributions.
In later chapters, we will turn to real-world data problems, where the underlying distributions are not known. In those cases, we use statistical inference to estimate or approximate these distributions from data. The probability concepts introduced here provide the theoretical foundation for those inferential methods.
To start, consider a simple discrete example. Suppose we draw one bead at random from an urn containing red and blue beads. Define the random variable as:
\[ X = \begin{cases} 1 & \text{if the bead is blue},\\ 0 & \text{if the bead is red.} \end{cases} \]
In R, we can simulate this process:
Each time we draw a bead, the value of \(X\) may change because the outcome is random. A variable like this, which takes only the values 0 and 1, is called a Bernoulli random variable. Bernoulli trials are the building blocks for many statistical models, since many outcomes, such as success/failure, yes/no, or heads/tails, can be represented in this way.
Note that not all random variables are discrete. Some can take on a continuum of values. For example, the height of a randomly selected person or the result of a physical measurement can be viewed as a continuous random variable. We can simulate such a variable using the normal distribution:
x <- rnorm(10, mean = 70, sd = 3)Here each value of x represents one realization of a random variable drawn from a normal distribution with mean 70 and standard deviation 3. If we were to repeat the simulation, we would obtain slightly different numbers each time, reflecting the inherent randomness of the process.
These two examples, a discrete Bernoulli variable and a continuous normal variable, illustrate the main types of random variables we will study in this chapter. Understanding their behavior and summarizing their distributions are the key steps that lead us toward the ideas of expected value, variability, and, ultimately, statistical inference.
5.2 Sampling models
Many data generation procedures can be effectively modeled as draws from an urn. For instance, we can model the process of polling likely voters as drawing 0s (Republicans) and 1s (Democrats) from an urn containing the 0 and 1 codes for all likely voters. In epidemiological studies, we often assume that the subjects in our study are a random sample from the population of interest. The data related to a specific outcome can be modeled as a random sample from an urn containing the outcomes for the entire population of interest. Similarly, in experimental research, we often assume that the individual organisms we are studying, for example worms, flies, or mice, are a random sample from a larger population. Randomized experiments can be modeled by draws from an urn, reflecting the way individuals are assigned into group; when getting assigned, individuals draw their group at random. Sampling models are therefore ubiquitous in data analysis. Casino games offer a plethora of real-world cases in which sampling models are used to answer specific questions. We will therefore start with these examples.
Suppose a very small casino hires you to consult on whether they should set up roulette wheels. To keep the example simple, we will assume that 1,000 people will play, and that the only game available on the roulette wheel is to bet on red or black. The casino wants you to predict how much money they will make or lose. They want a range of values and, in particular, they want to know what’s the chance of losing money. If this probability is too high, they will decide against installing roulette wheels.
We are going to define a random variable \(S_n\) that will represent the casino’s total winnings after \(n\) games. Let’s start by constructing the urn. A roulette wheel has 18 red pockets, 18 black pockets and 2 green ones. So playing a color in one game of roulette is equivalent to drawing from this urn:
The 1,000 outcomes from people playing are independent draws from this urn. If red comes up, the gambler wins, and the casino loses a dollar, resulting in the observed random variable being -$1. Otherwise, the casino wins a dollar, and the random variable is $1. To construct our the 1,000 outcomes of the random variable \(X\) we can use this code:
Note that the code above is shown for educational purposes only because we know the proportions of 1s and -1s, we can generate the draws with one line of code, without defining color:
We call this a sampling model, as it involves modeling the random behavior through the sampling of draws from an urn. The total winnings \(S_n\) is simply the sum of these 1,000 independent draws:
Thinking in terms of a sampling model is a powerful way to approach data analysis when uncertainty is involved. It provides a conceptual link between the random variables we study in probability and the datasets we encounter in practice. By imagining data as random draws from a well-defined process—an urn, a population, or an experiment—we gain clarity about what our models represent and the assumptions they depend on.
This perspective helps us interpret statistical procedures more thoughtfully. Many standard methods, such as confidence intervals, hypothesis tests, and regression models, often rely on assumptions that stem from a sampling models. Understanding these assumptions allows us to apply such methods in a principled way, recognize when they may fail, and develop models that more accurately reflect the uncertainty inherent in real data.
With this foundation, we can now turn to simple but instructive examples, like casino games, where the sampling model is fully known.
In this section, we use casino games as examples because their sampling models are completely known. The probabilities in these settings are defined by clear physical mechanisms,such as the number of red, black, and green pockets on a roulette wheel, which can be represented precisely as draws from an urn. These examples let us focus on the key ideas of probability, randomness, and sampling without the complications that arise in real-world data.
However, it is important to begin connecting these examples to practical applications. In real data analysis, the urn often represents a population, with each individual corresponding to one bead and each bead having a value associated with it, such as height, income, or health outcome. When we collect data, we are effectively drawing a random sample from this population urn.
Thinking this way helps develop statistical intuition. It reminds us that every dataset represents a sample from some underlying process, and that the assumptions behind our models and methods, such as independence and random sampling, are rooted in this conceptual framework. Building this habit early makes it easier to reason clearly about uncertainty and to apply statistical methods thoughtfully in practice.
5.3 The probability distribution of a random variable
If you rerun the code above, you see that s changes every time. This is, of course, because \(S_n\) is a random variable. The probability distribution of a random variable informs us about the probability of the observed value falling in any given interval. For example, if we want to know the probability that we lose money, we are asking the probability that \(S_n\) is in the interval \((-\infty,0)\).
Keep in mind that if we can define a cumulative distribution function \(F(a) = \mathrm{Pr}(S_n\leq a)\), then we will be able to answer any question related to the probability of events defined by our random variable \(S_n\), including the event \(S_n<0\). We call this \(F\) the random variable’s distribution function.
We can estimate the distribution function for the random variable \(S\) by using a Monte Carlo simulation to generate many realizations of the random variable. With this code, we run the experiment of having 1,000 people repeatedly play roulette, specifically \(B = 10,000\) times:
Now, we can ask the following: in our simulation, how often did we get sums less than or equal to a?
mean(s <= a)This will be a very good approximation of \(F(a)\), allowing us to easily answer the casino’s question “how likely is it that we will lose money?”. We can see it is quite low:
mean(s < 0)
#> [1] 0.0483We can visualize the distribution of \(S\) by creating a histogram showing the probability \(F(b)-F(a)\) for several intervals \((a,b]\):

We see that the distribution appears to be approximately normal. A qqplot will confirm that the normal approximation is close to a perfect approximation for this distribution.
As we have learned, if the distribution is normal, all we need to define it are the average and the standard deviation. Since we have the original values from which the distribution is generated, we can easily compute these with mean(s) and sd(s). The blue curve added to the histogram above is a normal density with this average and standard deviation.
This average and this standard deviation have special names; they are referred to as the expected value and standard error of the random variable \(S_n\). More details on these concepts will be provided in the next section.
Statistical theory offers a method to derive the distribution of random variables defined as the sum of independent random draw of numbers from an urn. Specifically, in our example above, we can demonstrate that the number of successes, \((S+n)/2\), follows a binomial distribution. We therefore do not need to run Monte Carlo simulations to determine the probability distribution of \(S\). The simulations were conducted for illustrative purposes.
We can use the function dbinom and pbinom to compute the probabilities exactly. For example, to compute \(\mathrm{Pr}(S_n < 0)\), we note that:
\[\mathrm{Pr}(S_n < 0) = \mathrm{Pr}((S_n + n)/2 < n/2) = \mathrm{Pr}((S_n + n)/2 \leq n/2 - 1)\]
and we can use the pbinom to compute \[\mathrm{Pr}(S_n \leq 0)\]:
pbinom(n/2 - 1, size = n, prob = 10/19)
#> [1] 0.0448For the details of the binomial distribution, you can consult any basic probability book or even Wikipedia1.
We do not delve into these details here. Instead, we will explore a useful theorem provided by mathematical theory, which generally applies to sums and averages of draws from any urn: the Central Limit Theorem (CLT).
5.4 Distributions versus probability distributions
Before we continue, let’s establish an important distinction and connection between the distribution of a list of numbers and a probability distribution. As explained in Chapter 1, any list of numbers \(x_1,\dots,x_n\) has a distribution. Given their usefulness as summaries when the distribution is approximately normal, we also define the average and standard deviation. These are determined with a straightforward operation involving the vector containing the list of numbers, denoted as x:
A random variable \(X\) is associated with a probability distribution, which describes the likelihood of different outcomes. This distribution is a theoretical concept, it does not depend on having a list of numbers or data in hand. As seen above, we define the distribution function as \(F(a) = \Pr(X \leq a)\), which answers the question: What is the probability that \(X\) takes a value less than or equal to \(a\)?
If \(X\) represents the result of drawing a number at random from an urn, then the list of numbers inside the urn defines the possible outcomes. The probability distribution of \(X\) corresponds to the relative frequencies of these numbers. The average and standard deviation of the numbers in the urn define the expected value and standard error of the random variable. The practical use of these definitions become aparent later in this chapter.
Another way to visualize this idea is through simulation. We can generate a large number of realizations of \(X\) using a Monte Carlo simulation, producing a long list of outcomes. The distribution of this simulated list provides an increasingly accurate approximation of the true probability distribution of \(X\). As the number of simulated draws grows, the sample average and standard deviation of the list converge to the expected value and standard error of the random variable.
5.5 Notation for random variables
In statistical textbooks, upper case letters denote random variables, and we will adhere to this convention. Lower case letters are used for observed values. You will see some notation that include both. For example, you will see events defined as \(X \leq x\). Here \(X\) is a random variable and \(x\) is an arbitrary value and not random. So, for example, \(X\) might represent the number on a die roll and \(x\) will represent an actual value we see: 1, 2, 3, 4, 5, or 6. In this case, the probability of \(X=x\) is 1/6 regardless of the observed value \(x\).
This notation may seem a bit strange because when we inquire about probability, \(X\) is not an observed quantity; it’s a random quantity that we will encounter in the future. We can discuss what we expect \(X\) to be, what values are probable, but we can’t discuss what value \(X\) is. Once we have data, we do see a realization of \(X\). Therefore, data analysts often speak of what could have been after observing what actually happened.
5.6 The expected value and standard error
We have described sampling models for draws. We will now review the mathematical theory that allows us to approximate the probability distributions for the sum of draws. Once we do this, we will be able to help the casino predict how much money they will make. The same approach we use for the sum of draws will be useful for describing the distribution of averages and proportion, which we will need to understand how polls work.
The first key concept to understand is the expected value. This quantity represents the long-run average outcome of a random variable after many repetitions of the underlying random process. Inq other words, if we could observe the random variable \(X\) over and over again, its average value would approach the expected value.
In statistics textbooks, the expected value is commonly denoted as \(\mu_X\) or as \(\mathrm{E}[X]\), both meaning “the expected value of the random variable \(X\).”
A random variable will vary around its expected value in a manner that if you take the average of many, many draws, the average will approximate the expected value. This approximation improves as you take more draws, making the expected value a useful quantity to compute.
For discrete random variable with possible outcomes \(x_1,\dots,x_n\), the expected value is defined as:
\[ \mathrm{E}[X] = \sum_{i=1}^n x_i \,\mathrm{Pr}(X = x_i) \] If \(X\) is a continuous random variable with a range of values \(a\) to \(b\) and a probability density function \(f(x)\), this sum becomes an integral:
\[ \mathrm{E}[X] = \int_a^b x f(x)\, dx \]
Note that in the case that we are picking values from an urn, and each value \(x_i\) has an equal chance \(1/n\) of being selected, the above equation is simply the average of the \(x_i\)s.
\[ \mathrm{E}[X] = \frac{1}{n}\sum_{i=1}^n x_i \]
In the urn used to model betting on red in roulette, we have 20 one-dollar bills and 18 negative one-dollar bills, so the expected value is:
\[ \mathrm{E}[X] = (20 + -18)/38 = 1/19 \]
which is about 5 cents. You might consider it a bit counterintuitive to say that \(X\) varies around 0.05 when it only takes the values 1 and -1. One way to make sense of the expected value in this context is by realizing that, if we play the game over and over, the casino wins, on average, 5 cents per game. A Monte Carlo simulation confirms this:
In general, if the urn has two possible outcomes, say \(a\) and \(b\), with proportions \(p\) and \(1-p\) respectively, the average is:
\[\mathrm{E}[X] = ap + b(1-p)\]
To confirm this, observe that if there are \(n\) beads in the urn, then we have \(np\) \(a\)s and \(n(1-p)\) \(b\)s, and because the average is the sum, \(n\times a \times p + n\times b \times (1-p)\), divided by the total \(n\), we get that the average is \(ap + b(1-p)\).
Now, the reason we define the expected value is because this mathematical definition turns out to be useful for approximating the probability distributions of sums. This, in turn, is useful for describing the distribution of averages and proportions. The first useful fact is that the expected value of the sum of the draws is the number of draws \(\times\) the average of the numbers in the urn.
Therefore, if 1,000 people play roulette, the casino expects to win, on average, about 1,000 \(\times\) $0.05 = $50.
However, this is an expected value. How different can one observation be from the expected value? The casino really needs to know this. What is the range of possibilities? If negative numbers are too likely, they will not install roulette wheels. Statistical theory once again answers this question.
The standard error (SE) gives us an idea of the size of the variation around the expected value. In statistics books, it’s common to use \(\sigma_X\) or \(\mathrm{SE}[X]\) to denote the standard error of a random variable.
Statistical textbooks often use the Greek letters \(\mu\) and \(\sigma\) as shorthand for the expected value and the standard error, respectively. The notation comes from the fact that \(\mu\) is the Greek equivalent of the letter m, the first letter of mean, which is another term for expected value. Likewise, \(\sigma\) corresponds to the letter s, the first letter of standard deviation (and by extension, standard error). This convention helps keep mathematical expressions concise and consistent across statistical formulas.
In this book, however, we prefer the \(\mathrm{E}[X]\) and \(\mathrm{SE}[X]\) notation because it makes it clearer when we express expectations and standard errors of sums or other mathematical transformations of random variables. For example, when considering the sum of two random variables, we can write \(\mathrm{E}[X + Y]\) directly.
For discrete random variable with possible outcomes \(x_1,\dots,x_n\), the standard error is defined as:
\[ \mathrm{SE}[X] = \sqrt{\sum_{i=1}^n \left(x_i - E[X]\right)^2 \,\mathrm{Pr}(X = x_i)}, \] which you can think of as the expected average distance of \(X\) from the expected value.
If \(X\) is a continuous random variable, with range of values \(a\) to \(b\) and probability density function \(f(x)\), this sum becomes integral:
\[ \mathrm{SE}[X] = \sqrt{\int_a^b \left(x-\mathrm{E}[X]\right)^2 f(x)\,\mathrm{d}x} \]
In many statistics textbooks, you will see the variance of a random variable defined as the square of its standard error:
\[ \mathrm{SE}[X] = \sqrt{\mathrm{Var}[X]}. \]
The variance is often used in mathematical derivations because it simplifies algebra by avoiding square roots. However, the standard error is more useful in practice, since it measures variability in the same units as the random variable itself. For example, if \(X\) represents height in inches, then \(\mathrm{SE}[X]\) is also measured in inches, making it easier to interpret.
Note that in the case that we are picking values from an un urn where each value \(x_i\) has an equal chance \(1/n\) of being selected, the above equation is simply the standard deviation of of the \(x_i\)s.
\[ \mathrm{SE}[X] = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \mathrm{E}[X])^2} \mbox{ with } \mathrm{E}[X] = \frac{1}{n}\sum_{i=1}^n x_i \] Using the definition of standard deviation, we can derive, with a bit of math, that if an urn contains two values \(a\) and \(b\) with proportions \(p\) and \((1-p)\), respectively, the standard deviation is:
\[| b - a | \sqrt{p(1-p)}.\]
So in our roulette example, the standard deviation of the values inside the urn is \(\mid 1 - (-1) \mid \sqrt{10/19 \times 9/19}\) or:
2*sqrt(90)/19
#> [1] 0.999The standard error tells us the typical difference between a random variable and its expectation. Since one draw is obviously the sum of just one draw, we can use the formula above to calculate that the random variable defined by one draw has an expected value of 0.05 and a standard error of about 1. This makes sense since we obtain either 1 or -1, with 1 slightly favored over -1.
A widely used mathematical result is that if our draws are independent, then the standard error of the sum is given by the equation:
\[ \sqrt{\mbox{number of draws}} \times \mbox{ standard deviation of the numbers in the urn} \]
Using this formula, the sum of 1,000 people playing has standard error of about $32:
As a result, when 1,000 people bet on red, the casino is expected to win $50 with a standard error of $32. It therefore seems like a safe bet to install more roulette wheels. But we still haven’t answered the question: How likely is the casino to lose money? The CLT will help in this regard.
The exact probability for the casino winnings can be computed precisely, rather than approximately, using the binomial distribution. However, here we focus on the CLT, which can be applied more broadly to sums of random variables in a way that the binomial distribution cannot.
5.7 Central Limit Theorem
The Central Limit Theorem (CLT) tells us that when the number of draws, also called the sample size, is large, the probability distribution of the sum of the independent draws is approximately normal. Given that sampling models are used for so many data generation processes, the CLT is considered one of the most important mathematical insights in history.
Previously, we discussed that if we know that the distribution of a list of numbers is approximated by the normal distribution, all we need to describe the list are the average and standard deviation. We also know that the same applies to probability distributions. If a random variable has a probability distribution that is approximated with the normal distribution, then all we need to describe the probability distribution are the expected value and standard error.
We previously ran this Monte Carlo simulation:
The Central Limit Theorem (CLT) tells us that the sum s is approximated by a normal distribution. Using the formulas above, we know that the expected value and standard error are:
The theoretical values above match those obtained with the Monte Carlo simulation:
Using the CLT, we can skip the Monte Carlo simulation and instead compute the probability of the casino losing money using this approximation:
which is also in very good agreement with our Monte Carlo result:
mean(s < 0)
#> [1] 0.04625.7.1 How large is large in the Central Limit Theorem?
The CLT works when the number of draws is large, but large is a relative term. In many circumstances, as few as 30 draws is enough to make the CLT useful. In some specific instances, as few as 10 is enough. However, these should not be considered general rules. Note, for example, that when the probability of success is very small, much larger sample sizes are needed.
To illustrate an important limitation of the Central Limit Theorem (CLT), consider a lottery. In most government-run lotteries, the chance of winning the grand prize is extremely small, often less than 1 in a million. Although thousands or even millions of people may buy tickets, the number of winners, the sum of all successes, is typically between 0 and a few individuals.
In this situation, the sum of the draws does not follow a shape that is well approximated by a normal distribution, even with such a large number of trials. The reason is that the probability of success is so small that the distribution of the total number of wins is highly skewed. In cases like this, when we have a large number of independent trials but each has a very low probability of success, the Poisson distribution provides a much better approximation than the normal.
You can explore the properties of the Poisson distribution using dpois and ppois. You can generate random variables following this distribution with rpois. However, we won’t cover the theory here. You can learn about the Poisson distribution in any probability textbook and even Wikipedia2.
In summary, the Central Limit Theorem is one of the most powerful results in probability and statistics. It explains why the normal distribution appears so often in data analysis and why many standard methods work well in practice, even when the underlying data are not normally distributed. When the number of draws is reasonably large and no single draw dominates the total, the CLT provides an accurate and remarkably robust approximation.
That said, rules of thumb such as “30 is enough” should always be treated as guidelines rather than guarantees. The CLT works under specific conditions, and when those conditions are not met, such as when probabilities are extremely small or the underlying population distribution is highly skewed, the normal approximation can fail badly. In these cases, other distributions, like the Poisson or binomial, may be more appropriate.
Developing an intuition for when the CLT applies and when it does not is an important part of becoming a thoughtful data analyst. Recognizing the assumptions behind statistical methods and questioning whether they hold in your context is what separates mechanical computation from principled analysis.
5.8 Statistical properties of averages
There are several useful mathematical results that we used above and often employ when working with data. We list them below.
1. The expected value of the sum of random variables is the sum of each random variable’s expected value. We can write it like this:
\[ \mathrm{E}[X_1+X_2+\dots+X_n] = \mathrm{E}[X_1] + \mathrm{E}[X_2]+\dots+\mathrm{E}[X_n] \]
If \(X\) represents independent draws from the urn, then they all have the same expected value. Let’s denote the expected value with \(\mu_X\) and rewrite the equation as:
\[ \mathrm{E}[X_1+X_2+\dots+X_n]= n\mu_X \]
which is another way of writing the result we show above for the sum of draws.
2. The expected value of a non-random constant times a random variable is the non-random constant times the expected value of a random variable. This is easier to explain with symbols:
\[ \mathrm{E}[aX] = a\times\mathrm{E}[X] \]
To understand why this is intuitive, consider changing units. If we change the units of a random variable, such as from dollars to cents, the expectation should change in the same way. A consequence of the above two facts is that the expected value of the average of independent draws from the same urn is the expected value of the urn, denoted as \(\mu_X\) again:
\[ \mathrm{E}[(X_1+X_2+\dots+X_n) / n]= \mathrm{E}[X_1+X_2+\dots+X_n] / n = n\mu_X/n = \mu_X \]
3. The variance of the sum of independent random variables is the sum of variances of each random variable:
\[ \mathrm{Var}[X_1+X_2+\dots+X_n] =\mathrm{Var}[X_1] + \mathrm{Var}[X_2]+\dots+\mathrm{Var}[X_n] \] This implies that the following property for the standard error of the sum of independent random variables:
\[ \mathrm{SE}[X_1+X_2+\dots+X_n] = \sqrt{\mathrm{SE}[X_1]^2 + \mathrm{SE}[X_2]^2+\dots+\mathrm{SE}[X_n]^2 } \]
Note that this particular property is not as intuitive as the previous three and more in depth explanations can be found in statistics textbooks.
4. The standard error of a non-random constant times a random variable is the non-random constant times the random variable’s standard error. As with the expectation:
\[ \mathrm{SE}[aX] = a \times \mathrm{SE}[X] \]
To see why this is intuitive, again think of units.
A consequence of 3 and 4 is that the standard error of the average of independent draws from the same urn is the standard deviation of the urn divided by the square root of \(n\) (the number of draws), call it \(\sigma_X\):
\[ \begin{aligned} \mathrm{SE}[\bar{X}] = \mathrm{SE}[(X_1+X_2+\dots+X_n) / n] &= \mathrm{SE}[X_1+X_2+\dots+X_n]/n \\ &= \sqrt{\mathrm{SE}[X_1]^2+\mathrm{SE}[X_2]^2+\dots+\mathrm{SE}[X_n]^2}/n \\ &= \sqrt{\sigma_X^2+\sigma_X^2+\dots+\sigma_X^2}/n\\ &= \sqrt{n\sigma_X^2}/n\\ &= \sigma_X / \sqrt{n} \end{aligned} \]
The given equation reveals crucial insights for practical scenarios. Specifically, it suggests that the standard error can be minimized by increasing the sample size, \(n\), and we can quantify this reduction. However, this principle holds true only when the variables \(X_1, X_2, ... X_n\) are independent. If they are not, the estimated standard error can be significantly off.
In Section 15.2, we introduce the concept of correlation, which quantifies the degree to which variables are interdependent. If the correlation coefficient among the \(X\) variables is \(\rho\), the standard error of their average is:
\[ \mathrm{SE}\left(\bar{X}\right) = \sigma_X \sqrt{\frac{1 + (n-1) \rho}{n}} \]
The key observation here is that as \(\rho\) approaches its upper limit of 1, the standard error increases. Notably, in the situation where \(\rho = 1\), the standard error, \(\mathrm{SE}(\bar{X})\), equals \(\sigma_X\), and it becomes unaffected by the sample size \(n\).
5. If \(X\) is a normally distributed random variable, then if \(a\) and \(b\) are non-random constants, \(aX + b\) is also a normally distributed random variable. All we are doing is changing the units of the random variable by multiplying by \(a\), then shifting the center by \(b\).
5.8.1 Law of large numbers
An important implication of result 4 above is that the standard error of the average becomes smaller and smaller as \(n\) grows larger. When \(n\) is very large, then the standard error is practically 0 and the average of the draws converges to the average of the urn. This is known in statistical textbooks as the law of large numbers or the law of averages.
The law of averages is sometimes misinterpreted. For example, if you toss a coin 5 times and see a head each time, you might hear someone argue that the next toss is probably a tail because of the law of averages: on average we should see 50% heads and 50% tails. A similar argument would be to say that red “is due” on the roulette wheel after seeing black come up five times in a row. Yet these events are independent so the chance of a coin landing heads is 50%, regardless of the previous 5. The same principle applies to the roulette outcome. The law of averages applies only when the number of draws is very large and not in small samples. After a million tosses, you will definitely see about 50% heads regardless of the outcome of the first five tosses. Another funny misuse of the law of averages is in sports when TV sportscasters predict a player is about to succeed because they have failed a few times in a row.
5.9 Exercises
1. In American Roulette, you can also bet on green. There are 18 reds, 18 blacks, and 2 greens (0 and 00). What are the chances the green comes out?
2. The payout for winning on green is $17 dollars. This means that if you bet a dollar and it lands on green, you get $17. Create a sampling model using sample to simulate the random variable \(X\) for your winnings. Hint: Refer to the example below for how it should look like when betting on red.
3. Compute the expected value of \(X\).
4. Compute the standard error of \(X\).
5. Now create a random variable \(S_n\) that is the sum of your winnings after betting on green 1000 times. Hint: change the argument size and replace in your answer to exercise 2. Start your code by setting the seed to 1 with set.seed(1).
6. What is the expected value of \(S_n\)?
7. What is the standard error of \(S_n\)?
8. What is the probability that you end up winning money? Hint: Use the CLT.
9. Create a Monte Carlo simulation that generates 1,000 outcomes of \(S_n\). Compute the average and standard deviation of the resulting list to confirm the results of 6 and 7. Start your code by setting the seed to 1 with set.seed(1).
10. Now check your answer to 8 using the Monte Carlo result.
11. The Monte Carlo result and the CLT approximation are close, but not that close. What could account for this?
- 1,000 simulations is not enough. If we do more, they match.
- The CLT does not work as well when the probability of success is small. In this case, it was 1/19. If we make the number of roulette plays bigger, they will match better.
- The difference is within rounding error.
- The CLT only works for averages.
12. Now create a random variable \(Y\) that is your average winnings per bet, after playing off your winnings after betting on green 1,000 times.
13. What is the expected value of \(Y\)?
14. What is the standard error of \(Y\)?
15. What is the probability that you end up with winnings per game that are positive? Hint: Use the CLT.
16. Create a Monte Carlo simulation that generates 2,500 outcomes of \(Y\). Compute the average and standard deviation of the resulting list to confirm the results of 13 and 14. Start your code by setting the seed to 1 with set.seed(1).
17. Now compare your answer to 15 using the Monte Carlo result.
18. The Monte Carlo result and the CLT approximation are now much closer. What could account for this?
- We are now computing averages instead of sums.
- 2,500 Monte Carlo simulations is not better than 1,000.
- The CLT works better when the sample size is larger. We increased from 1,000 to 2,500.
- It is not closer. The difference is within rounding error.
The following exercises are inspired by the events surrounding the financial crisis of 2007-20083. This financial crisis was in part caused by underestimating the risk of certain securities4 sold by financial institutions. Specifically, the risks of mortgage-backed securities (MBS) and collateralized debt obligations (CDO) were grossly underestimated. These assets were sold at prices that assumed most homeowners would make their monthly payments, and the probability of this not occurring was calculated as being low. A combination of factors resulted in many more defaults than were expected, which led to a price crash of these securities. As a consequence, banks lost so much money that they required government bailouts to avoid complete closure.
19. More complex versions of the sampling models we have discussed are also used by banks to determine interest rates and insurance companies to determine premiums. To understand this, suppose you run a small bank that has a history of identifying potential homeowners that can be trusted to make payments. In fact, historically, only 2% of your customers default in a given year, meaning that they don’t pay back the money that you lent them. Suppose your bank will give out $n=$1,000 loans for $180,000 this year. Also, after adding up all costs, suppose your bank loses \(l\)=$200,000 per foreclosure. For simplicity, we assume this includes all operational costs. What is the expected profit \(S_n\) for you bank under this scenario?
20. Note that the total loss defined by the final sum in the previous exercise is a random variable. Every time you run the sampling model code, you obtain a different number of people defaulting which results in a different loss. Code a sampling model for the random variable representing your banks profit \(S_n\) under scenario described in 19.
21. The previous exercise demonstrates that if you simply loan money to everybody without interest, you will end up losing money due to the 2% that defaults. Although you know 2% of your clients will probably default, you don’t know which ones, so you can’t remove them. Yet by charging everybody just a bit extra in interest, you can make up the losses incurred due to that 2%, and also cover your operating costs. What quantity \(x\) would you have to charge each borrower so that your bank’s expected profit is 0? Assume that you don’t get \(x\) from the borrowers that default. Also, note \(x\) is not the interest rate, but the total you add meaning \(x/180000\) is the interest rate.
22. Rewrite the sample model from exercise 20 and run a Monte Carlo simulation to get an idea of the distribution of your profit when you charge interest rates.
23. We don’t actually need a Monte Carlo simulation. Based on what we have learned, the CLT informs us that, since our losses are a sum of independent draws, its distribution is approximately normal. What are the expected value and standard errors of the profit \(S_n\)? Write these as functions of the probability of foreclosure \(p\), the number of loans \(n\), the loss per foreclosure \(l\), and the quantity you charge each borrower \(x\).
24. If you set \(x\) to assure your bank breaks even (expected profit is 0), what is the probability that your bank loses money?
25. Suppose that if your bank has negative profit, it has to close. Therefore, you need to increase \(x\) to minimize this risk. However, setting the interest rates too high may lead your clients to choose another bank. So, let’s say that we want our chances of losing money to be 1 in 100. What does the \(x\) quantity need to be now? Hint: We want \(\mathrm{Pr}(S_n<0) = 0.01\). Note that you can add subtract constants to both side of an inequality, and the probability does not change: \(\mathrm{Pr}(S_n<0) = \mathrm{Pr}(S_n+k<0+k)\), Similarly, with division of positive constants: \(\mathrm{Pr}(S_n+k<0+k) = \mathrm{Pr}((S_n+k)/m <k/m)\). Use this fact and the CLT to transform the left side of the inequality in \(\mathrm{Pr}(S_n<0)\) into a standard normal.
26. Our interest rate now increases. But it is still a very competitive interest rate. For the \(x\) you obtained in 25, what is expected profit per loan and the expected total profit?
27. Run run a Monte Carlo simulation to double check the theoretical approximation used in 25 and 26.
28. One of your employees points out that, since the bank is making a profit per loan, the bank should give out more loans! Why limit it to just \(n\)? You explain that finding those \(n\) clients was hard. You need a group that is predictable and that keeps the chances of defaults low. The employee then points out that even if the probability of default is higher, as long as our expected value is positive, you can minimize your chances of losses by increasing \(n\) and relying on the law of large numbers. Suppose the default probability is twice as high, or 4%, and you set the interest rate to 5%, or \(x=\)$9,000, what is your expected profit per loan?
29. How much do we have to increase \(n\) by to assure the probability of losing money is still less than 0.01?
30. Confirm the result in exercise 29 with a Monte Carlo simulation.
31. According to this equation, giving out more loans increases your expected profit and lowers the chances of losing money! Giving out more loans seems like a no-brainier. As a result, your colleague decides to leave your bank and start his own high-risk mortgage company. A few months later, your colleague’s bank has gone bankrupt. A book is written, and eventually, the movies “The Big Short” and “Margin Call” are made, recounting the mistake your friend, and many others, made. What happened?
Your colleague’s scheme was mainly based on this mathematical formula \(\mathrm{SE}\left(\bar{X}\right) = \sigma_X / \sqrt{n}\). By making \(n\) large, we minimize the standard error of our per-loan profit. However, for this rule to hold, the \(X\)s must be independent draws: one person defaulting must be independent of others defaulting.
To construct a more realistic simulation than the original one your colleague ran, let’s assume there is a global event affecting everybody with high-risk mortgages and altering their probability simultaneously. We will assume that with a 50-50 chance all the default probabilities slightly increase or decrease to somewhere between 0.03 and 0.05. However, this change occurs universally, impacting everybody at once, not just one person. As these draws are no longer independent, our equation for the standard error of the sum of random variables does not apply. Write a Monte Carlo simulation for your total profit with this model.
32. Use the simulation results to report the expected profit, the probability of losing money, and the probability of losing more than $10,000,000. Study the distribution of profit and discuss how making the wrong assumption lead to a catastrophic result.