Introduction to Data Science

Data Wrangling and Visualization with R


This is the website for the Data Wrangling and Visualization with R part of Introduction to Data Science.

The website for the Statistics and Prediction Algorithms Through Case Studies part is here1.

We make announcements related to the book on Twitter. For updates follow @rafalab2.

This book started out as the class notes used in the HarvardX Data Science Series3.

The Quarto files used to generate the book is available on GitHub4. Note that, the graphical theme used for plots throughout the book can be recreated using the ds_theme_set() function from dslabs package.

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.05.

A hardcopy version of the book is available from CRC Press6.

A free PDF of the October 24, 2019 version of the book is available from Leanpub7.


This book is dedicated to all the people involved in building and maintaining R and the R packages we use in this book. A special thanks to the developers and maintainers of base R, the tidyverse, data.table, and the caret package.

A special thanks to Jenna Landy for her careful editing and helpful advice on this book; to David Robinson for generously answering many questions about the tidyverse and aiding in my understanding of it; and to Amy Gill for dozens of comments, edits, and suggestions. Also, many thanks to Stephanie Hicks who twice served as a co-instructor in my data science classes and Yihui Xie who patiently put up with my many questions related to markdown. Thanks also to Karl Broman, from whom I borrowed ideas for the Data Visualization and Productivity Tools parts. Thanks to Peter Aldhous from whom I borrowed ideas for the principles of data visualization section and Jenny Bryan for writing Happy Git and GitHub for the useR, which influenced our Git chapters. Also, many thanks to Jeff Leek, Roger Peng, and Brian Caffo, whose onlile classes inspired the way this book is divided, to Garrett Grolemund and Hadley Wickham for making the markdown code for their R for Data Science book open, and the editors, John Kimmel and Lara Spieker, for their support. Finally, thanks to Alex Nones for proofreading the manuscript during its various stages.

This book was conceived during the teaching of several applied statistics courses, starting over fifteen years ago. The teaching assistants working with me throughout the years made important indirect contributions to this book. The material was further refined during a HarvardX series coordinated by Heather Sternshein and Zofia Gajdos. We thank them for their contributions. We are also grateful to all the students whose questions and comments helped us improve the book. The courses were partially funded by NIH grant R25GM114818. We are very grateful to the National Institutes of Health for its support.

A special thanks goes to all those who edited the book via GitHub pull requests or made suggestions by creating an issue or sending an email: nickyfoto (Huang Qiang), desautm (Marc-André Désautels), michaschwab (Michail Schwab), alvarolarreategui (Alvaro Larreategui), jakevc (Jake VanCampen), omerta (Guillermo Lengemann), espinielli (Enrico Spinielli), asimumba(Aaron Simumba), braunschweig (Maldewar), gwierzchowski (Grzegorz Wierzchowski), technocrat (Richard Careaga), atzakas, defeit (David Emerson Feit), shiraamitchell (Shira Mitchell), Nathalie-S, andreashandel (Andreas Handel), berkowitze (Elias Berkowitz), Dean-Webb (Dean Webber), mohayusuf, jimrothstein, mPloenzke (Matthew Ploenzke), NicholasDowand (Nicholas Dow), kant (Darío Hereñú), debbieyuster (Debbie Yuster), tuanchauict (Tuan Chau), phzeller, BTJ01 (BradJ), glsnow (Greg Snow), mberlanda (Mauro Berlanda), wfan9, larswestvang (Lars Westvang), jj999 (Jan Andrejkovic), Kriegslustig (Luca Nils Schmid), odahhani, aidanhorn (Aidan Horn), atraxler (Adrienne Traxler), alvegorova,wycheong (Won Young Cheong), med-hat (Medhat Khalil), kengustafson, Yowza63, ryan-heslin (Ryan Heslin), raffaem, tim8west, jooleer, pauluhn (Paul), tci1, beanb2 (Brennan Bean), edasdemirlab (Erdi Dasdemir), jimnicholls (Jim Nicholls), JimKay1941 (Jim Kay), David D. Kane, El Mustapha El Abbassi, Vadim Zipunnikov, Anna Quaglieri, Chris Dong, Bowen Gu, and Rick Schoenberg.