Introduction

The phrase data science began gaining significant popularity around 2012, thanks in part to the publication titled “Data Scientist: The Most Alluring Profession of the 21st Century”¹(https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century). This coincided with the rise of a new kind of endeavor in the technology sector and in academic projects during the 2000s: extracting insights from messy, complex, and large datasets, which had become increasingly prevalent with the advent of digital data storage.

Examples include combining data from multiple political pollsters to improve election predictions, scraping athletic department websites to evaluate baseball prospects, analyzing movie ratings from millions of streaming service users to make personalized recommendations, developing software to read zip codes by digitizing handwritten digits, and using advanced measurement technologies to understand the molecular causes of diseases. This book is centered around these and other practical examples.

Achieving success in these instances requires collaboration among experts with complementary skills. In this book, our primary focus is on data analysis, specifically, the statistical thinking that allows us to draw meaningful conclusions from data. To understand how to analyze data effectively in these examples, we will cover key mathematical concepts. Many of these concepts are not new, some were originally developed for different purposes, but they have proven adaptable and useful across a wide range of applications.

The mathematical difficulty in this book varies by topic, and we assume you are already familiar with the required material. However, the recommended readings at the end of each part can help fill in any gaps. The math is not an end in itself, it is a tool for articulating and deepening our understanding of statistical ideas and thow they are useful for data analysis.

The same is true for the code. We alternate between base R, data.table, and the tidyverse, choosing whichever is most appropriate for the task at hand. Given how capable large language models (LLMs) have become at writing and explaining code, we do not focus on syntax or programming best practices. Instead, we include code because it connects statistical ideas to data, and therefore to real-world applications.

For example, running Monte Carlo simulations on your own computer and generating your own data can make abstract probability concepts tangible. In general, we encourage you to experiment: change parameters, modify the code, or design your own simulations to test and refine your understanding. This process of active exploration is central to developing intuition about randomness, uncertainty, and inference.

While LLMs can now produce working code for many specific problems, they are not yet capable of doing the main thing we teach here: thinking statistically about data-driven questions. This skill involves formulating questions, visually exploring data, identifying sources of variability, and reasoning about uncertainty. These are human-centered tasks that remain at the heart of data analysis.

The book is divided into six parts: Summary Statistics, Probability, Statistical Inference, Linear Models, High-Dimensional Data, and Machine Learning. The first two parts introduce key statistical ideas through illustrative examples, while the later parts focus on real-world case studies that demonstrate how these ideas come together in practice. Readers already comfortable with probability and statistical theory who are mainly interested in the applied perspective may wish to skip ahead to those later sections.

Each part is organized into concise chapters designed to fit within a single lecture and paired with exercises for practice. All datasets used in the book are available in the dslabs package, and the complete Quarto source files can be found on GitHub ².

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century↩︎
[https://github.com/rafalab/dsbook-part-2]↩︎