4 Connecting Data and Probability
In this part of the book, we use games of chance, tossing coins, drawing beads, spinning a roulette wheel, to illustrate the principles of probability. These examples are valuable because the probabilistic mechanism is perfectly known. We can compute exact probabilities, repeat experiments as many times as we like, and verify the results mathematically or by simulation.
In real-world data analysis, the connection between probability and observation is rarely so clear. The systems we study, such as, people, hospitals, weather, economies, and biological processes, are complex. We use probability not because these systems are necessarily truly random, but because probabilistic models provide a practical way to describe and reason about unexplained variability and uncertainty. In this chapter, we introduce some of the ways real-world problems can be connected to probabilistic models, so you can begin to recognize these connections as you learn about probability theory in the following chapters.
4.1 Probability models in the real world
Today, probability underlies far more than gambling. We use it to describe the likelihood of rain, the risk of disease, or the uncertainty in an election forecast. Yet, outside of games of chance, the meaning of probability is often less obvious. In practice, when we use probability to analyze data, we are almost always assuming that a probabilistic model applies. Sometimes this assumption is well motivated; other times, it is simply a convenient approximation.
It is important to keep this in mind: when our probability model is wrong, our conclusions may be wrong as well.
Below we go over the five main ways we connect probability to data.
Random sampling
In polling, we often assume that the people we answer our questions are a random sample from the population of interest. For example, imagine putting the phone numbers of all eligible voters into a large bag and drawing 1,000 at random, then calling them and asking who they support. If this process were truly random, the statistical methods we will introduce will apply directly.
This assumption is also common in animal or laboratory studies. When a researcher orders mice from a breeding facility, the supplier’s shipment is often treated as a random sample from all possible mice in that strain. This makes it possible to apply probability models in a way very similar to our games of chance.
Random assignment
A second setting where probability connects directly to data is randomized experiments. In clinical trials, for instance, patients are assigned to a treatment or control group by chance, often by flipping a coin. This ensures that, on average, the groups are comparable, and that any difference in outcomes can be attributed to the treatment itself. The same logic underlies what is today called A/B testing, widely used in technology and business settings to compare two versions of a product, website, or algorithm. Whether in medicine or online platforms, these experiments rely on the same fundamental idea: random assignment creates conditions where probability can be used to infer causation. In these examples, as in urn or coin examples, the link between probability and data is clear.
When randomness is assumed
In many real-world analyses, data are not collected through random sampling or controlled experiments. Instead, most datasets we work with are convenience samples, collected from individuals who are easy to reach rather than randomly selected from a well-defined population. Examples include participants in online surveys, patients from a particular hospital, students in a specific class, or volunteers in a research study.
When we fit probability models to such data, we implicitly make an assumption that the sample behaves as if it were random. This assumption is rarely stated explicitly, yet it underlies many analyses that report results such as \(p\)-values, confidence intervals, or predicted probabilities. These quantities all rely on probabilistic reasoning, which presumes that the data can be treated as draws from a random process.
The same assumption appears in machine learning. When we train a model, we use only a subset of available data and expect the model to generalize to unseen cases. This expectation depends on the idea that the training data are, for all practical purposes, a random sample from the same population as the future data. In practice, however, these training sets are often convenience samples as well.
Despite this, such models and analyses often work remarkably well and have provided valuable insights. For example, studies using observational data have convincingly shown that smoking harms health and that regular exercise and good diet extend life. But when data come from convenience samples, extra care is needed. The assumptions behind our probability models become especially important. Treating a convenience sample as if it were a simple random sample can lead to biased and misleading conclusions.
These biases can sometimes be corrected with statistical techniques, but doing so requires making additional assumptions. Understanding what those assumptions are, and thinking critically about whether they are reasonable, is essential for sound data analysis. Later in the book (see Chapter 19), we will revisit this topic and show concrete examples of how unexamined assumptions about randomness can distort results.
Natural variation
Finally, many natural processes appear random even when no measurement error or sampling is involved. Which genes are inherited, small biometric differences between genetically identical organisms, variation in the stock market, the number of earthquakes in a year, whether a particular cell becomes cancerous, or the number of photons striking a telescope’s detector, all vary in ways that seem unpredictable. In these situations, probability serves as a modeling tool: a concise way to describe complex systems whose underlying mechanisms are too intricate, chaotic, or numerous to predict exactly.
In such cases, probabilistic models do not imply that nature itself is random. Rather, they acknowledge our limited ability to predict and give us a useful way to summarize uncertainty.
4.2 The role of design
Because most real-world data do not arise from ideal random processes, our ability to draw valid conclusions depends critically on the design of studies and experiments. As the statistician Ronald Fisher famously remarked, “To consult the statistician after an experiment is finished is often merely to ask him to perform a post-mortem examination.” Design is what connects the mathematics of probability to the messy realities of data collection.
A well-designed study, one that uses randomization, proper sampling, and consistent measurement, creates conditions under which probabilistic reasoning is meaningful. Without this foundation, even the most sophisticated analyses can produce misleading results. Random assignment helps isolate cause and effect. Random sampling helps ensure representativeness. Standardized procedures reduce bias and make uncertainty quantifiable.
This book does not cover experimental design in detail, but in the Further Reading section we recommend several excellent introductions. For now, the key idea is this: when we use probability to reason about data, we are implicitly assuming a connection between a mathematical model and a real process. The stronger that connection, through careful design, random sampling, or random assignment, the more trustworthy our conclusions will be. Poor design, on the other hand, weakens this bridge and makes uncertainty far harder to interpret.
4.3 Exercises
1. Suppose a national poll reports that 52% of voters support a candidate, based on a random sample of 1,000 people. a. Explain what “random sample” means in this context. b. Describe at least two ways the real sampling process might deviate from true randomness.
2. A clinical trial randomly assigns 500 participants to treatment and 500 to control. a. Is this an example of random sampling or random assignment? b. How does random assignment help us draw causal conclusions? c. What assumptions must still hold for the trial’s conclusions to generalize to the broader population?
3. A researcher records each patient’s blood pressure twice and finds that the two measurements differ by about 5 mmHg on average. Explain how measurement error can be viewed as a source of randomness.
4. You are studying caffeine consumption among college students, but your survey responses come primarily from biology majors. a. Why might treating this sample as “random” lead to misleading conclusions? b. Suggest at least one way to improve the representativeness of the data collection process. c. Describe a real-world scenario where improving representativeness might be difficult or impossible.
5. For each of the following, identify the primary source of randomness being modeled, sampling variation, measurement error, natural variation, or random assignment: a. Estimating unemployment from a household survey. b. Measuring air temperature every hour for a week. c. Testing whether a fertilizer increases crop yield. d. Comparing test scores from different schools. e. Predicting how many hurricanes will form next year.