Introduction

The phrase data science began gaining significant popularity around 2012, thanks in part to the publication titled “Data Scientist: The Most Alluring Profession of the 21st Century”¹. This aligns with the rise of a new kind of endeavor in the technology sector and some academic projects during the 2000s: the extraction of insights from messy, complex, and large datasets, which had become increasingly prevalent, all mad possible with the advent of digital storage of data.

Some examples include using data from various political pollsters to improve election predictions, extracting information from athletic department websites to evaluate baseball prospects, analyzing movie ratings from all streaming service users to make personalized recommendations, developing software to read zip codes by digitizing written digits, and using advanced measurement technologies to understand the molecular causes of diseases. This book is centered around these, and other practical examples.

Achieving success in these instances involves a collaborative effort by a team of experts with different but complementary skills. In this book, our primary focus is on data analysis. To grasp the best ways to analyze data effectively in the mentioned examples, we will cover key mathematical concepts. Some of these concepts are not new and were originally developed for different purposes, but they have proven to be adaptable and useful in various contexts.

Over the past several decades, data analysts have developed ideas, concepts, and methodologies applicable across a broad range of projects. They’ve also identified common ways to get fooled by apparent patterns in the data and important mathematical realities that are not immediately obvious. This collective wisdom has evolved into the field of Statistics, a discipline offering a mathematical framework to simplify the articulation and rigorous assessment of these concepts. For a data analyst, it’s crucial to have a comprehensive understanding of this field to prevent repeated errors and unnecessary reinvention of methodologies.

There is no shortage of exceptional Statistics textbooks detailing this mathematical framework. In this book, we emphasize bridging theory and practice, applying these concepts to actual real-world challenges using data examples and in-depth case studies. We provide representative case studies that mirror what a practicing data analyst experiences. In each case study, we present and break down the R code applied to solve the problem. We also use R code to elucidate key statistical concepts often discussed in a mathematical context.

The book is divided into six sections: Summary Statistics, Probability, Statistical Inference, Linear Models, High Dimensional Data and Machine Learning. Although the the first two parts use data examples to illustrate concepts, the real-world case studies don’t appear until the third part. Each part comprises several chapters, each roughly designed for a single lecture and including a variety of exercises. All data referenced in the book is included in the dslabs package with all the Quarto code used to generate the book available on GitHub ².

Who will find this book useful?

This book is meant to be a textbook for a second course in Data Science with a focus on data analysis. Previous knowledge of R, such as that covered in Introduction to Data Science, is necessary. If you read and understand all the chapters and complete all the exercises in this book, you will be well-positioned to perform advanced data analysis tasks and you will be prepared to learn the more advanced concepts and skills needed to become an expert.

What is not covered by this book?

This book focuses on the application of statistical and machine learning methods in data analysis. We do not go in depth into the theoretical aspects of the methods, and highly recommend complementing this book with probability and statistics textbooks. We also do not cover aspects related to data management or engineering. Although R programming is an essential part of the book, we do not teach more advanced computer science topics such as data structures, optimization, and algorithm theory. Similarly, we do not cover topics such as web services, interactive graphics, parallel computing, and data streaming processing.

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century↩︎
https://github.com/rafalab/dsbook-part-2↩︎