Chapter 40 Reproducible projects with RStudio and R markdown
The final product of a data analysis project is often a report. Many scientific publications can be thought of as a final report of a data analysis. The same is true for news articles based on data, an analysis report for your company, or lecture notes for a class on how to analyze data. The reports are often on paper or in a PDF that includes a textual description of the findings along with some figures and tables resulting from the analysis.
Imagine that after you finish the analysis and the report, you are told that you were given the wrong dataset, you are sent a new one and you are asked to run the same analysis with this new dataset. Or what if you realize that a mistake was made and need to re-examine the code, fix the error, and re-run the analysis? Or imagine that someone you are training wants to see the code and be able to reproduce the results to learn about your approach?
Situations like the ones just described are actually quite common for a data scientist. Here, we describe how you can keep your data science projects organized with RStudio so that re-running an analysis is straight-forward. We then demonstrate how to generate reproducible reports with R markdown and the knitR package in a way that will greatly help with recreating reports with minimal work. This is possible due to the fact that R markdown documents permit code and textual descriptions to be combined into the same document, and the figures and tables produced by the code are automatically added to the document.
40.1 RStudio projects
RStudio provides a way to keep all the components of a data analysis project organized into one folder and to keep track of information about this project, such as the Git status of files, in one file. In Section 39.6 we demonstrate how RStudio facilitates the use of Git and GitHub through RStudio projects. In this section we quickly demonstrate how to start a new a project and some recommendations on how to keep these organized. RStudio projects also permit you to have several RStudio sessions open and keep track of which is which.
To start a project, click on File and then New Project. Often we have already created a folder to save the work, as we did in Section 38.7 and we select Existing Directory. Here we show an example in which we have not yet created a folder and select the New Directory option.
Then, for a data analysis project, you usually select the New Project option:
Now you will have to decide on the location of the folder that will be associated with your project, as well as the name of the folder. When choosing a folder name, just like with file names, make sure it is a meaningful name that will help you remember what the project is about. As with files, we recommend using lower case letters, no spaces, and hyphens to separate words. We will call the folder for this project my-first-project. This will then generate a Rproj file called my-first-project.Rproj in the folder associated with the project. We will see how this is useful a few lines below.
You will be given options on where this folder should be on your filesystem. In this example, we will place it in our home folder, but this is generally not good practice. As we described in Section 38.7 in the Unix chapter, you want to organize your filesystem following a hierarchical approach and with a folder called projects where you keep a folder for each project.
When you start using RStudio with a project, you will see the project name in the upper right corner. This will remind you what project this particular RStudio session belongs to. When you open an RStudio session with no project, it will say Project: (None).
When working on a project, all files will be saved and searched for in the folder associated with the project. Below, we show an example of a script that we wrote and saved with the name code.R. Because we used a meaningful name for the project, we can be a bit less informative when we name the files. Although we do not do it here, you can have several scripts open at once. You simply need to click File, then New File and pick the type of file you want to edit.
One of the main advantages of using Projects is that after closing RStudio, if we wish to continue where we left off on the project, we simply double click or open the file saved when we first created the RStudio project. In this case, the file is called my-first-project.Rproj. If we open this file, RStudio will start up and open the scripts we were editing.
Another advantage is that if you click on two or more different Rproj files, you start new RStudio and R sessions for each.
40.2 R markdown
R markdown is a format for literate programming documents. It is based on markdown, a markup language that is widely used to generate html pages. You can learn more about markdown here: https://www.markdowntutorial.com/. Literate programming weaves instructions, documentation, and detailed comments in between machine executable code, producing a document that describes the program that is best for human understanding (Knuth 1984). Unlike a word processor, such as Microsoft Word, where what you see is what you get, with R markdown, you need to compile the document into the final report. The R markdown document looks different than the final product. This seems like a disadvantage at first, but it is not because, for example, instead of producing plots and inserting them one by one into the word processing document, the plots are automatically added.
In RStudio, you can start an R markdown document by clicking on File, New File, the R Markdown. You will then be asked to enter a title and author for your document. We are going to prepare a report on gun murders so we will give it an appropriate name. You can also decide what format you would like the final report to be in: HTML, PDF, or Microsoft Word. Later, we can easily change this, but here we select html as it is the preferred format for debugging purposes:
This will generate a template file:
As a convention, we use the Rmd
suffix for these files.
Once you gain experience with R Markdown, you will be able to do this without the template and can simply start from a blank template.
In the template, you will see several things to note.
40.2.1 The header
At the top you see:
---
title: "Report on Gun Murders"
author: "Rafael Irizarry"
date: "April 16, 2018"
output: html_document
---
The things between the ---
is the header. We actually don’t need a header, but it is often useful. You can define many other things in the header than what is included in the template. We don’t discuss those here, but much information is available online. The one parameter that we will highlight is output
. By changing this to, say, pdf_document
, we can control the type of output that is produced when we compile.
40.2.2 R code chunks
In various places in the document, we see something like this:
```{r}
summary(pressure)
```
These are the code chunks. When you compile the document, the R code inside the chunk, in this case summary(pressure)
, will be evaluated and the result included in that position in the final document.
To add your own R chunks, you can type the characters above quickly with the key binding command-option-I on the Mac and Ctrl-Alt-I on Windows.
This applies to plots as well; the plot will be placed in that position. We can write something like this:
```{r}
plot(pressure)
```
By default, the code will show up as well. To avoid having the code show up, you can use an argument. To avoid this, you can use the argument echo=FALSE
. For example:
```{r, echo=FALSE}
summary(pressure)
```
We recommend getting into the habit of adding a label to the R code chunks. This will be very useful when debugging, among other situations. You do this by adding a descriptive word like this:
```{r pressure-summary}
summary(pressure)
```
40.2.3 Global options
One of the R chunks contains a complex looking call:
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
We will not cover this here, but as you become more experienced with R Markdown, you will learn the advantages of setting global options for the compilation process.
40.2.4 knitR
We use the knitR package to compile R markdown documents. The specific function used to compile is the knit
function, which takes a filename as input. RStudio provides a button that makes it easier to compile the document. For the screenshot below, we have edited the document so that a report on gun murders is produced. You can see the file here: https://raw.githubusercontent.com/rairizarry/murders/master/report.Rmd. You can now click on the Knit
button:
The first time you click on the Knit button, a dialog box may appear asking you to install packages you need.
Once you have installed the packages, clicking the Knit will compile your R markdown file and the resulting document will pop up:
This produces an html document which you can see in your working directory. To view it, open a terminal and list the files. You can open the file in a browser and use this to present your analysis. You can also produce a PDF or Microsoft document by changing:
output: html_document
to output: pdf_document
or output: word_document
.
We can also produce documents that render on GitHub using output: github_document
.
This will produce a markdown file, with suffix md
, that renders in GitHub. Because we have uploaded these files to GitHub, you can click on the md
file and you will see the report as a webpage:
This is a convenient way to share your reports.
40.2.5 More on R markdown
There is a lot more you can do with R markdown. We highly recommend you continue learning as you gain more experience writing reports in R. There are many free resources on the internet including:
- RStudio’s tutorial: https://rmarkdown.rstudio.com
- The cheat sheet: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
- The knitR book: https://yihui.name/knitr/
40.3 Organizing a data science project
In this section we put it all together to create the US murders project and share it on GitHub.
40.3.1 Create directories in Unix
In Section 38.7 we demonstrated how to use Unix to prepare for a data science project using an example. Here we continue this example and show how to use RStudio. In Section 38.7 we created the following directories using Unix:
40.3.2 Create an RStudio project
In the next section we will use create an RStudio project. In RStudio we go to File and then New Project… and when given the options we pick Existing Directory. We then write the full path of the murders
directory created above.
Once you do this, you will see the rdas
and data
directories you created in the RStudio Files tab.
Keep in mind that when we are in this project, our default working directory will be ~/projects/murders
. You can confirm this by typing getwd()
into your R session. This is important because it will help us organize the code when we need to write file paths.
Pro tip: always use relative paths in code for data science projects. These should be relative to the default working directory. The problem with using full paths is that your code is unlikely to work on filesystems other than yours since the directory structures will be different. This includes using the home directory ~
as part of your path.
40.3.3 Edit some R scripts
Let’s now write a script that downloads a file into the data directory. We will call this file download-data.R
.
The content of this file will be:
url <- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/
extdata/murders.csv"
dest_file <- "data/murders.csv"
download.file(url, destfile = dest_file)
Notice that we are using the relative path data/murders.csv
.
Run this code in R and you will see that a file is added to the data
directory.
Now we are ready to write a script to read this data and prepare a table that we can use for analysis. Call the file wrangle-data.R
. The content of this file will be:
library(tidyverse)
murders <- read_csv("data/murders.csv")
murders <-murders |> mutate(region = factor(region),
rate = total / population * 10^5)
save(murders, file = "rdas/murders.rda")
Again note that we use relative paths exclusively.
In this file, we introduce an R command we have not seen: save
. The save
command in R saves objects into what is called an rda file: rda is short for R data. We recommend using the .rda
suffix on files saving R objects. You will see that .RData
is also used.
If you run this code above, the processed data object will be saved in a file in the rda
directory. Although not the case here, this approach is often practical because generating the data object we use for final analyses and plots can be a complex and time-consuming process. So we run this process once and save the file. But we still want to be able to generate the entire analysis from the raw data.
Now we are ready to write the analysis file. Let’s call it analysis.R
. The content should be the following:
library(tidyverse)
load("rdas/murders.rda")
murders |> mutate(abb = reorder(abb, rate)) |>
ggplot(aes(abb, rate)) +
geom_bar(width = 0.5, stat = "identity", color = "black") +
coord_flip()
If you run this analysis, you will see that it generates a plot.
40.3.4 Create some more directories using Unix
Now suppose we want to save the generated plot for use in a report or presentation. We can do this with the ggplot command ggsave
. But where do we put the graph? We should be systematically organized so we will save plots to a directory called figs
. Start by creating a directory by typing the following in the terminal:
and then you can add the line:
to your R script. If you run the script now, a png file will be saved into the figs
directory. If we wanted to copy that file to some other directory where we are developing a presentation, we can avoid using the mouse by using the cp
command in our terminal.
40.3.5 Add a README file
You now have a self-contained analysis in one directory. One final recommendation is to create a README.txt
file describing what each of these files does for the benefit of others reading your code, including your future self. This would not be a script but just some notes. One of the options provided when opening a new file in RStudio is a text file. You can save something like this into the text file:
We analyze US gun murder data collected by the FBI.
download-data.R - Downloads csv file to data directory
wrangle-data.R - Creates a derived dataset and saves as R object in rdas
directory
analysis.R - A plot is generated and saved in the figs directory.
40.3.6 Initializing a Git directory
In Section 39.5 we demonstrated how to initialize a Git directory and connect it to the upstream repository on GitHub, which we already created in that section.
We can do this in the Unix terminal:
40.3.7 Add, commit, and push files using RStudio
We can continue adding and committing each file, but it might be easier to use RStudio. To do this, start the project by opening the Rproj file. The git icons should appear and you can add, commit and push using these.
We can now go to GitHub and confirm that our files are there.
You can see a version of this project, organized with Unix directories, on GitHub123.
You can download a copy to your computer by using the git clone
command on your terminal. This command will create a directory called murders
in your working directory, so be careful where you call it from.