Introduction

The phrase data science began gaining significant popularity around 2012, thanks in part to the publication titled “Data Scientist: The Most Alluring Profession of the 21st Century”1(https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century)]. This coincided with the rise of a new kind of endeavor in the technology sector and in academic projects during the 2000s: extracting insights from messy, complex, and large datasets, which had become increasingly prevalent with the advent of digital data storage.

Examples include combining data from multiple political pollsters to improve election predictions, scraping athletic department websites to evaluate baseball prospects, analyzing movie ratings from millions of streaming service users to make personalized recommendations, developing software to read zip codes by digitizing handwritten digits, and using advanced measurement technologies to understand the molecular causes of diseases. This book is centered around these and other practical examples.

Achieving success in these instances requires collaboration among experts with complementary skills. In this book, our primary focus is on data analysis. To understand how to analyze data effectively in these examples, we will cover key mathematical concepts. Many of these concepts are not new, some were originally developed for different purposes, but they have proven adaptable and useful across a wide range of applications.

Over several decades, data analysts have developed ideas, concepts, and methodologies that apply broadly across projects. They have also identified common ways analysts can be misled by apparent patterns in the data, as well as important mathematical truths that are not immediately obvious. This collective wisdom has evolved into the field of Statistics, which offers a mathematical framework to articulate and rigorously assess these ideas. For a data analyst, having a strong foundation in Statistics is essential to avoid repeating mistakes and reinventing methods that are already well understood.

There is no shortage of excellent Statistics textbooks describing this framework. In fact, we reference several of them in Recommended Reading sections throughout the book. In this book, however, we emphasize bridging theory and practice, applying statistical concepts to real-world problems through case studies and worked examples. We provide representative case studies that mirror what a practicing data analyst encounters, and we present the R code used to solve each problem. To illustrate different styles of working with data, we use base R, data.table, and the tidyverse, choosing whichever tool is most appropriate for the task at hand.

The book is divided into six sections: Summary Statistics, Probability, Statistical Inference, Linear Models, High Dimensional Data, and Machine Learning. While the first two sections use data examples to illustrate concepts, the real-world case studies begin in the third section. Each section comprises several chapters, each designed to fit into a single lecture and accompanied by exercises. All data referenced in the book is available in the dslabs package, and the Quarto source code for the book is available on GitHub2.


  1. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century↩︎

  2. [https://github.com/rafalab/dsbook-part-2]↩︎