Data analysis is one of the main focuses of this book. While the computing tools we have introduced are relatively recent developments, data analysis has been around for over a century. Throughout the years, data analysts working on specific projects have come up with ideas and concepts that generalize across many applications. They have also identified common ways to get fooled by apparent patterns in the data and important mathematical realities that are not immediately obvious. The accumulation of these ideas and insights has given rise to the discipline of statistics, which provides a mathematical framework that greatly facilitates the description and formal evaluation of these ideas.
To avoid repeating common mistakes and wasting time reinventing the wheel, it is important for a data analyst to have an in-depth understanding of statistics. However, due to the maturity of the discipline, there are dozens of excellent books already published on this topic and we therefore do not focus on describing the mathematical framework here. Instead, we introduce concepts briefly and then provide detailed case studies demonstrating how statistics is used in data analysis along with R code implementing these ideas. We also use R code to help elucidate some of the main statistical concepts that are usually described using mathematics. We highly recommend complementing this part of the book with a basic statistics textbook. Two examples are Statistics by Freedman, Pisani, and Purves and Statistical Inference by Casella and Berger. The specific concepts covered in this part are Summary Statistics, Probability, Statistical Inference, Statistical Models, Regression, and Linear Models, which are major topics covered in a statistics course. The case studies we present relate to the financial crisis, forecasting election results, understanding heredity, and building a baseball team.