Introduction

The demand for skilled data science practitioners in industry, academia, and government is rapidly growing. This book introduces skills that can help you tackle real-world data analysis challenges. These include R programming, data wrangling with dplyr, data visualization with ggplot2, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation with Quarto and knitr. The book is divided into four parts: R, Data Visualization, Data Wrangling, and Productivity Tools. Each part has several chapters meant to be presented as one lecture and includes dozens of exercises distributed across chapters.

Case studies

Throughout the book, we use motivating case studies. In each case study, we try to realistically mimic a data scientist’s experience. For each of the skills covered, we start by asking specific questions and answer these through data analysis. We learn the concepts as a means to answer the questions. Examples of the case studies included in the book are:

Case Study Concept
US murder rates by state R Basics
Student heights Statistical Summaries
Trends in world health and economics Data Visualization
The impact of vaccines on infectious disease rates Data Visualization
Reported student heights Data Wrangling

Who will find this book useful?

This book is meant to be a textbook for a first course in Data Science. No previous knowledge of R is necessary, although some experience with programming may be helpful. To be a successful data analyst implementing these skill requires understanding advanced statistical concepts, such as those covered in Advanced Data Science. If you read and understand all the chapters and complete all the exercises in this book, and understand statistical concepts, you will be well-positioned to perform basic data analysis tasks and you will be prepared to learn the more advanced concepts and skills needed to become an expert.

What does this book cover?

We start by going over the basics of R, the tidyverse, and the data.table package. You learn R throughout the book, but in the first part we go over the building blocks needed to keep learning.

The growing availability of informative datasets and software tools has led to increased reliance on data visualizations in many fields. In the second part we demonstrate how to use ggplot2 to generate graphs and describe important data visualization principles.

The third part uses several examples to familiarize the reader with data wrangling. Among the specific skills we learn are web scraping, using regular expressions, and joining and reshaping data tables. We do this using tidyverse tools.

In the final part, we provide a brief introduction to the productivity tools we use on a day-to-day basis in data science projects. These are RStudio, UNIX/Linux shell, Git and GitHub, Quarto, and knitr.

What is not covered by this book?

This book focuses on the computing skills necessary for the data analysis aspects of data science. As mentioned, we do not cover statistical concepts. We also do not cover aspects related to data management or engineering. Although R programming is an essential part of the book, we do not teach more advanced computer science topics such as data structures, optimization, and algorithm theory. Similarly, we do not cover topics such as web services, interactive graphics, parallel computing, and data streaming processing.