Introduction

Over the years, data analysts have developed ideas, concepts, and methodologies applicable across a broad range of projects. They’ve also identified common ways to get fooled by apparent patterns in the data and important mathematical realities that are not immediately obvious. This collective wisdom has evolved into the field of Statistics, a discipline offering a mathematical framework to simplify the articulation and rigorous assessment of these concepts. For a data analyst, it’s crucial to have a comprehensive understanding of this field to prevent repeated errors and unnecessary reinvention of methodologies.

There is no shortage of exceptional Statistics textbooks detailing this mathematical framework. However, in this book, we emphasize bridging theory and practice, applying these concepts to actual real-world challenges using data examples and in-depth case studies. We provide representative case studies that mirror what a practicing data analyst experiences. These include election forecasting, baseball team construction, biology experiments, movie recommendation systems, and deciphering hand-written digits. In each case study, we present and break down the R code applied to solve the problem. We also use R code to elucidate key statistical concepts often discussed in a mathematical context.

The book is divided into six sections: Summary Statistics, Probability, Statistical Inference, Linear Models, High Dimensional Data and Machine Learning. Although the the first two parts use data examples to illustrate concepts, real-world case studies don’t appear until the third part. Each part comprises several chapters, each roughly designed for a single lecture and including a variety of exercises. All data referenced in the book is included in the dslabs package with all the Quarto code used to generate the book available on GitHub1.

Who will find this book useful?

This book is meant to be a textbook for a second course in Data Science. Previous knowledge of R, such as that covered in Introduction to Data Science, is necessary. If you read and understand all the chapters and complete all the exercises in this book, you will be well-positioned to perform advanced data analysis tasks and you will be prepared to learn the more advanced concepts and skills needed to become an expert.

What is not covered by this book?

This book focuses on the application of statistical and machine learning methods in data analysis. We do not go in depth into the theoretical aspects of the methods, and highly recommend complementing this book with probability and statistics textbooks.


  1. https://github.com/rafalab/dsbook-part-2↩︎