Introduction to Data Science

Statistics and Prediction Algorithms Through Case Studies


This is the website for the Statistics and Prediction Algorithms Through Case Studies part of Introduction to Data Science.

The website for the Data Wrangling and Visualization with R is here.

This book started out as part of the class notes used in the HarvardX Data Science Series1.

A hardcopy version of the first edition of the book, which combined both parts, is available from CRC Press2.

A free PDF of the October 24, 2019 version of the book, which combined both parts, is available from Leanpub3.

The Quarto code used to generate the book is available on GitHub4. Note that, the graphical theme used for plots throughout the book can be recreated using the ds_theme_set() function from dslabs package.

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.

We make announcements related to the book on Twitter. For updates follow @rafalab.


A special thanks to my tidyverse guru David Robinson and Amy Gill for dozens of comments, edits, and suggestions. Also, many thanks to Stephanie Hicks who twice served as a co-instructor in my data science classes and Yihui Xie who patiently put up with my many questions about bookdown. Thanks also to Héctor Corrada-Bravo, for advice on how to best teach machine learning. Thanks to Alyssa Frazee for helping create the homework problem that became the Recommendation Systems case study. Also, many thanks to Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund for making the Quarto code for their R for Data Science book open. Finally, thanks to Alex Nones for proofreading the manuscript during its various stages.

This book was conceived during the teaching of several applied statistics courses, starting over fifteen years ago. The teaching assistants working with me throughout the years made important indirect contributions to this book. The latest iteration of this course is a HarvardX series coordinated by Heather Sternshein and Zofia Gajdos. We thank them for their contributions. We are also grateful to all the students whose questions and comments helped us improve the book. The courses were partially funded by NIH grant R25GM114818. We are very grateful to the National Institutes of Health for its support.

A special thanks goes to all those who edited the book via GitHub pull requests or made suggestions by creating an issue or sending an email: nickyfoto (Huang Qiang), desautm (Marc-André Désautels), michaschwab (Michail Schwab), alvarolarreategui (Alvaro Larreategui), jakevc (Jake VanCampen), omerta (Guillermo Lengemann), espinielli (Enrico Spinielli), asimumba(Aaron Simumba), braunschweig (Maldewar), gwierzchowski (Grzegorz Wierzchowski), technocrat (Richard Careaga), atzakas, defeit (David Emerson Feit), shiraamitchell (Shira Mitchell), Nathalie-S, andreashandel (Andreas Handel), berkowitze (Elias Berkowitz), Dean-Webb (Dean Webber), mohayusuf, jimrothstein, mPloenzke (Matthew Ploenzke), NicholasDowand (Nicholas Dow), kant (Darío Hereñú), debbieyuster (Debbie Yuster), tuanchauict (Tuan Chau), phzeller, BTJ01 (BradJ), glsnow (Greg Snow), mberlanda (Mauro Berlanda), wfan9, larswestvang (Lars Westvang), jj999 (Jan Andrejkovic), Kriegslustig (Luca Nils Schmid), odahhani, aidanhorn (Aidan Horn), atraxler (Adrienne Traxler), alvegorova,wycheong (Won Young Cheong), med-hat (Medhat Khalil), biscotty666 (Brian Carey), kengustafson, Yowza63, ryan-heslin (Ryan Heslin), raffaem, tim8west, David D. Kane, El Mustapha El Abbassi, Vadim Zipunnikov, Anna Quaglieri, Chris Dong, Rick Schoenberg, Isabella Grabski, and Doug Snyder.