library(dplyr)
library(dslabs)
library(data.table)
5 data.table
In this book, we use tidyverse packages, primarily because they offer readability that is beneficial for beginners. This readability allows us to emphasize data analysis and statistical concepts. However, while tidyverse is beginner-friendly, there are other methods in R that are more efficient and can handle larger datasets more effectively. One such package is data.table, which is widely used in the R community. We’ll briefly introduce data.table in this chapter. For those interested in diving deeper, there are numerous online resources, including the mentioned introduction1.
5.1 Refining data tables
data.table
is a separate package that needs to be installed. Once installed, we then need to load it along with the other packages we will use:
We will provide example code showing the data.table approaches to dplyr’s mutate
, filter
, select
, group_by
, and summarize
shown in Chapter 4. As in that chapter, we will use the murders
dataset:
The first step when using data.table is to convert the data frame into a data.table
object using the as.data.table
function:
murders_dt <- as.data.table(murders)
Without this initial step, most of the approaches shown below will not work.
5.1.1 Column-wise subsetting
Selecting with data.table is done in a similar way to subsetting matrices. While with dplyr we write
select(murders, state, region)
in data.table we use
murders_dt[, c("state", "region")]
We can also use the .()
data.table notation to alert R that variables inside the parenthesis are column names, not objects in the R environment. So the above can also be written like this:
murders_dt[, .(state, region)]
5.1.2 Adding or transformin variables
We learned to use the dplyr mutate
function with this example:
murders <- mutate(murders, rate = total / population * 100000)
data.table uses an approach that avoids a new assignment (update by reference). This can help with large datasets that take up most of your computer’s memory. The data.table :=
function permits us to do this:
murders_dt[, rate := total / population * 100000]
This adds a new column, rate
, to the table. Notice that, as in dplyr, we used total
and population
without quotes.
To define new multiple columns, we can use the :=
function with multiple arguments:
murders_dt[, ":="(rate = total / population * 100000, rank = rank(population))]
5.1.3 Reference versus copy
The data.table package is designed to avoid wasting memory. So if you make a copy of a table, like this:
x <- data.table(a = 1)
y <- x
y
is actually referencing x
, it is not an new object: y
just another name for x
. Until you change y
, a new object will not be made. However, the :=
function changes by reference so if you change x
, a new object is not made and y
continues to be just another name for x
:
x[,a := 2]
y
#> a
#> 1: 2
You can also change x
like this:
y[,a := 1]
x
#> a
#> 1: 1
To avoid this, you can use the copy
function which forces the creation of an actual copy:
x <- data.table(a = 1)
y <- copy(x)
x[,a := 2]
y
#> a
#> 1: 1
Note that the function as.data.table
creates a copy of the data frame being converted. However, if working with a large data frames it is helpful to avoid this by using setDT
:
x <- data.frame(a = 1)
setDT(x)
Note that because no copy is being made the following code does not create a new object:
x <- data.frame(a = 1)
y <- setDT(x)
The objects x
and y
are referencing the same data table:
x[,a := 2]
y
#> a
#> 1: 2
5.1.4 Row-wise subsetting
With dplyr, we filtered like this:
filter(murders, rate <= 0.7)
With data.table, we again use an approach similar to subsetting matrices, except like dplyr, data.table knows that rate
refers to a column name and not an object in the R environment:
murders_dt[rate <= 0.7]
Notice that we can combine the filter and select into one succinct command. Here are the state names and rates for those with rates below 0.7.
murders_dt[rate <= 0.7, .(state, rate)]
#> state rate
#> 1: Hawaii 0.515
#> 2: Iowa 0.689
#> 3: New Hampshire 0.380
#> 4: North Dakota 0.595
#> 5: Vermont 0.320
which is more compact than the dplyr approach:
You are ready to do exercises 1-7.
5.2 Summarizing data
As an example, we will use the heights
dataset:
heights_dt <- as.data.table(heights)
In data.table, we can call functions inside .()
and they will be applied to columns So the equivalent of:
in dplyr is the following in data.table:
Note that this permits a compact way of subsetting and then summarizing. Instead of:
we can write:
5.2.1 Multiple summaries
In Chapter 4, we defined the following function to permit multiple column summaries in dplyr:
median_min_max <- function(x){
qs <- quantile(x, c(0.5, 0, 1))
data.frame(median = qs[1], minimum = qs[2], maximum = qs[3])
}
In data.table we place a function call within .()
to obtain the three number summary:
heights_dt[, .(median_min_max(height))]
5.2.2 Group then summarize
The group_by
followed by summarize
in dplyr is performed in one line in data.table. We simply add the by
argument to split the data into groups based on the values in categorical variable:
5.3 Sorting
We can order rows using the same approach we use for filter. Here are the states ordered by murder rate:
murders_dt[order(population)]
To sort the table in descending order, we can order by the negative of population
or use the decreasing
argument:
murders_dt[order(population, decreasing = TRUE)]
Similarly, we can perform nested ordering by including more than one variable in order:
murders_dt[order(region, rate)]
You are ready to do exercises 8-12.
5.4 Exercises
1. Load the data.table package and the murders
dataset and convert it to data.table
object:
library(data.table)
library(dslabs)
murders_dt <- as.data.table(murders)
Remember you can add columns like this:
murders_dt[, population_in_millions := population / 10^6]
Add a murders
column named rate
with the per 100,000 murder rate as in the example code above.
2. Add a column rank
containing the rank, from highest to lowest murder rate.
3. If we want to only show the states and population sizes, we can use:
murders_dt[, .(state, population)]
Show the state names and abbreviations in murders
.
4. You can show just the New York row like this:
murders_dt[state == "New York"]
You can use other logical vectors to filter rows.
Show the top 5 states with the highest murder rates. From here on, do not change the murders
dataset, just show the result. Remember that you can filter based on the rank
column.
5. We can remove rows using the !=
operator. For example, to remove Florida, we would do this:
no_florida <- murders_dt[state != "Florida"]
Create a new data frame called no_south
that removes states from the South region. How many states are in this category? You can use the function nrow
for this.
6. We can also use %in%
to filter. You can therefore see the data from New York and Texas as follows:
Create a new data frame called murders_nw
with only the states from the Northeast and the West. How many states are in this category?
7. Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with filter
. Here is an example in which we filter to keep only small states in the Northeast region.
murders_dt[population < 5000000 & region == "Northeast"]
Make sure murders
has been defined with rate
and rank
and still has all states. Create a table called my_states
that contains rows for states satisfying both the conditions: they are in the Northeast or West and the murder rate is less than 1. Show only the state name, the rate, and the rank.
For exercises 8-12, we will be using the NHANES data.
library(NHANES)
8. We will provide some basic facts about blood pressure. First let’s select a group to set the standard. We will use 20-to-29-year-old females. AgeDecade
is a categorical variable with these ages. Note that the category is coded like " 20-29"
, with a space in front! Use the data.table package to compute the average and standard deviation of systolic blood pressure as saved in the BPSysAve
variable. Save it to a variable called ref
.
9. Report the min and max values for the same group.
10. Compute the average and standard deviation for females, but for each age group separately rather than a selected decade as in exercise 8. Note that the age groups are defined by AgeDecade
.
11. Repeat exercise 10 for males.
12. For males between the ages of 40-49, compare systolic blood pressure across race as reported in the Race1
variable. Order the resulting table from lowest to highest average systolic blood pressure.