Generally speaking, we do not recommend using point-and-click approaches for data analysis. Instead, we recommend scripting languages, such as R, since they are more flexible and greatly facilitate reproducibility. Similarly, we recommend against the use of point-and-click approaches to organizing files and document preparation. In this chapter, we demonstrate alternative approaches. Specifically, we will learn to use freely available tools that, although at first may seem cumbersome and non-intuitive, will eventually make you a much more efficient and productive data scientist.
Three general guiding principles that motivate what we learn here are 1) be systematic when organizing your filesystem, 2) automate when possible, and 3) minimize the use of the mouse. As you become more proficient at coding, you will find that 1) you want to minimize the time you spend remembering what you called a file or where you put it, 2) if you find yourself repeating the same task over and over, there is probably a way to automate, and 3) anytime your fingers leave the keyboard, it results in loss of productivity.
A data analysis project is not always a dataset and a script. A typical data analysis challenge may involve several parts, each involving several data files, including files containing the scripts we use to analyze data. Keeping all this organized can be challenging. We will learn to use the Unix shell as a tool for managing files and directories on your computer system. Using Unix will permit you to use the keyboard, rather than the mouse, when creating folders, moving from directory to directory, and renaming, deleting, or moving files. We also provide specific suggestions on how to keep the filesystem organized.
The data analysis process is also iterative and adaptive. As a result, we are constantly editing our scripts and reports. In this chapter, we introduce you to the version control system Git, which is a powerful tool for keeping track of these changes. We also introduce you to GitHub110, a service that permits you to host and share your code. We will demonstrate how you can use this service to facilitate collaborations. Keep in mind that another positive benefit of using GitHub is that you can easily showcase your work to potential employers.
Finally, we learn to write reports in R markdown, which permits you to incorporate text and code into a single document. We will demonstrate how, using the
knitr package, we can write reproducible and aesthetically pleasing reports by running the analysis and generating the report simultaneously.
We will put all this together using the powerful integrated desktop environment RStudio111. Throughout the chapter we will be building up an example on US gun murders. The final project, which includes several files and folders, can be seen here: https://github.com/rairizarry/murders. Note that one of the files in that project is the final report: https://github.com/rairizarry/murders/blob/master/report.md.