15 Wrangling with `data.table`

The first three chapters described how to reshape data, join tables, and parse dates and times with the tidyverse.

library(tidyverse)

This can all be done with data.table as well.

library(data.table)

Here we show the data.table version of some of the tidyverse commands we previously showed. The data.table functions are faster and more efficient with memory. In general, everything you can do with tidyverse can be done with data.table and base R which, although perhaps harder to read, it is often more flexible, faster, and more efficient. Here we show just a few examples, but you can learn others using internet searches or code generation tools.

15.1 Reshaping data

Previously we used this example:

library(dslabs)
path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "fertility-two-countries-example.csv")

15.1.1 `pivot_longer` is `melt`

If in tidyverse we write

wide_data <- read_csv(filename)
new_tidy_data <- wide_data |>
  pivot_longer(-1, names_to = "year", values_to = "fertility")

in data.table we use the melt function.

dt_wide_data <- fread(filename) 
dt_new_tidy_data  <- melt(dt_wide_data, 
                      measure.vars = 2:ncol(dt_wide_data), 
                      variable.name = "year", 
                      value.name = "fertility")

15.2 `pivot_wider` is `dcast`

If in tidyverse we write

new_wide_data <- new_tidy_data |> 
  pivot_wider(names_from = year, values_from = fertility)

in data.table we use the dcast function.

dt_new_wide_data <- dcast(dt_new_tidy_data, formula = ... ~ year,
                          value.var = "fertility")

15.2.1 Separating variables

path <- system.file("extdata", package = "dslabs")
filename <- "life-expectancy-and-fertility-two-countries-example.csv"
filename <-  file.path(path, filename)

In tidyverse we wrangled using

raw_dat <- read_csv(filename)
dat <- raw_dat |> pivot_longer(-country) |>
  separate_wider_delim(name, delim = "_", names = c("year", "name"), 
                       too_many = "merge") |>
  pivot_wider() |>
  mutate(year = as.integer(year))

In data.table we can use the tstrsplit function:

dt_raw_dat <- fread(filename)
dat_long <- melt(dt_raw_dat, 
                 measure.vars = which(names(dt_raw_dat) != "country"), 
                 variable.name = "name", value.name = "value")
dat_long[, c("year", "name", "name2") := 
           tstrsplit(name, "_", fixed = TRUE, type.convert = TRUE)]
dat_long[is.na(name2), name2 := ""]
dat_long[, name := paste(name, name2, sep = "_")][, name2 := NULL]
dat_wide <- dcast(dat_long, country + year ~ name, value.var = "value")

15.3 Joins

In tidyverse we joined two tables with left_join:

tab <- left_join(murders, results_us_election_2016, by = "state")

In data.table the merge functions works similarly:

tab <- merge(murders, results_us_election_2016, 
             by = "state", all.x = TRUE)

Instead of defining different functions for the different type of joins, merge uses the the logical arguments all (full join), all.x (left join), and all.y (right join).

15.4 Dates and times

The data.table package also includes some of the same functionality as lubridate. For example, it includes the mday, month, and year functions:

data.table::mday(now())
#> [1] 3
data.table::month(now())
#> [1] 1
data.table::year(now())
#> [1] 2024

Other similar functions are second, minute, hour, wday, week, isoweek quarter, yearmon, yearqtr.

The package also includes the class IDate and ITime, which store dates and times more efficiently, convenient for large files with date stamps. You convert dates in the usual R format using as.IDate and as.ITime.

15.5 Exercises

Repeat exercises in Chapter 11, Section 12.1, and Chapter 13 using data.table instead of tidyverse.