4 Connecting Data and Probability
In this part of the book, we use games of chance, rolling dice, drawing beads, spinning a roulette wheel—to illustrate the principles of probability. These examples are valuable because the probabilistic mechanism is perfectly known. We can compute exact probabilities, repeat experiments as many times as we like, and verify the results mathematically or by simulation.
In real-world data analysis, the connection between probability and observation is rarely so clear. The systems we study, people, hospitals, weather, economies, and biological processes, are complex. We use probability not because these systems are not necessarily truly random, but because probabilistic models provide a practical way to describe and reason about uncertainty. In this chapter, we introduce some of the ways real-world problems can be connected to probabilistic models, so you can begin to recognize these connections as you learn about probability.
4.1 Probability models in the real world
Today, probability underlies far more than gambling. We use it to describe the likelihood of rain, the risk of disease, or the uncertainty in an election forecast. Yet, outside of games of chance, the meaning of probability is often less obvious. In practice, when we use probability to analyze data, we are almost always assuming that a probabilistic model applies. Sometimes this assumption is well motivated; other times, it is simply a convenient approximation.
It is important to keep this in mind: when our probability model is wrong, our conclusions may be wrong as well.
Below we go over the five main ways we connect probability to data.
Random sampling
In polling, we often assume that the people we ask are a random sample from the population of interest. For example, imagine putting the phone numbers of all eligible voters into a large bag and drawing 1,000 at random, the calling them and asking who they support. If this process were truly random, the statistical methods we will introduce will apply directly.
This assumption is also common in animal or laboratory studies. When a researcher orders mice from a breeding facility, the supplier’s shipment is often treated as a random sample from all possible mice in that strain. This makes it possible to apply probability models in a way very similar to our games of chance.
Random assignment
A second setting where probability connects directly to data is randomized experiments. In clinical trials, for instance, patients are assigned to a treatment or control group by chance, often by flipping a coin. This ensures that, on average, the groups are comparable, and that any difference in outcomes can be attributed to the treatment itself. The same logic underlies what is today called A/B testing, widely used in technology and business settings to compare two versions of a product, website, or algorithm. Whether in medicine or online platforms, these experiments rely on the same fundamental idea: random assignment creates conditions where probability can be used to infer causation. In these examples, as in urn or coin examples, the link between probability and data is clear.
When randomness is assumed
In many fields, however, we cannot truly randomize. Social scientists cannot randomly assign income levels or education. Epidemiologists cannot assign smoking habits or exposure to pollutants. In these situations, we often assume that the data we have are a random sample from a population of interest, say, all adults in a city or all patients with a certain condition.
This assumption may or may not be valid. If the data come from volunteers, a single hospital, or an online survey, the sample might not represent the broader population. When this happens, our probabilistic models break down, and the associations we observe may not reflect reality. Later in the book (see Chapter 20), we will discuss these problems in detail and show how biases can arise when these assumptions fail.
This same assumption underlies much of machine learning. When we train an algorithm, we almost never use all possible data but instead a training subset. We then expect the model to generalize to new, unseen data. This expectation depends on the idea that the training data are a random sample from the same underlying population as future data. For this reason, modern machine learning models often express predictions as probabilities rather than certainties, an algorithm might predict an 90% probability that an email is spam rather than declaring it is spam outright or that a borrower forcloses on a loan is 5%. These probabilities, and the confidence we place in them, all rest on probabilistic assumptions about how our data were generated and how representative our samples truly are.
The same kind of probabilistic thinking introduced in this part of the book, about randomness, uncertainty, and sampling—forms the foundation of these more advanced methods. Understanding how and why we model uncertainty
Measurement error
Another common use of probability models arises when we treat errors as random. For example, each time a doctor measures a patient’s height, the result may differ slightly due to small mistakes in positioning or reading the scale. Similar errors occur when measuring weight, blood pressure, or tumor size. We often assume these discrepancies are random, sometimes called measurement error, and model them probabilistically. Even when measurement errors are small, acknowledging their randomness allows us to quantify uncertainty and build more reliable models.
Natural variation
Finally, many natural processes appear random even when no measurement or sampling is involved. The height of siblings, the number of earthquakes in a year, whether a particular cell becomes cancerous, or the number of photons hitting a telescope’s detector, all vary in ways that seem unpredictable. Here, probability serves as a modeling tool: a concise way to describe complex systems whose details are too intricate or chaotic to predict exactly.
In such cases, probabilistic models do not claim that nature itself is random. Rather, they acknowledge our limited ability to predict and summarize uncertainty in a useful way.
4.2 The role of design
Because most real-world data do not come from ideal random processes, our ability to draw valid conclusions depends heavily on the design of studies and experiments. As the statistician Ronald Fisher famously remarked, “To consult the statistician after an experiment is finished is often merely to ask him to perform a post-mortem examination.” Good design, randomization, proper sampling, and careful data collection, ensures that probability models are meaningful and that uncertainty can be quantified correctly.
This book does not cover experimental design in detail, but in the Further Reading section we recommend several texts that do. For now, the key idea is this:
when we use probability to reason about data, we are assuming a connection between a mathematical model and a real process. The closer that connection is to random sampling or random assignment, the more trustworthy our conclusions will be.
4.3 Exercises
1. Suppose a national poll reports that 52% of voters support a candidate, based on a random sample of 1,000 people. a. Explain what “random sample” means in this context. b. Describe at least two ways the real sampling process might deviate from true randomness.
2. A clinical trial randomly assigns 500 participants to treatment and 500 to control. a. Is this an example of random sampling or random assignment? b. How does random assignment help us draw causal conclusions? c. What assumptions must still hold for the trial’s conclusions to generalize to the broader population?
3. A researcher records each patient’s blood pressure twice and finds that the two measurements differ by about 5 mmHg on average. Explain how measurement error can be viewed as a source of randomness.
4. You are studying caffeine consumption among college students, but your survey responses come primarily from biology majors. a. Why might treating this sample as “random” lead to misleading conclusions? b. Suggest at least one way to improve the representativeness of the data collection process. c. Describe a real-world scenario where improving representativeness might be difficult or impossible.
5. For each of the following, identify the primary source of randomness being modeled — sampling variation, measurement error, natural variation, or random assignment: a. Estimating unemployment from a household survey. b. Measuring air temperature every hour for a week. c. Testing whether a fertilizer increases crop yield. d. Comparing test scores from different schools. e. Predicting how many hurricanes will form next year.