11 Reshaping data
As we have seen through the book, having data in tidy format is what makes the tidyverse flow. After the first step in the data analysis process, importing data, a common next step is to reshape the data into a form that facilitates the rest of the analysis. The tidyr package, part of tidyverse, includes several functions that are useful for tidying data.
We will use the fertility wide format dataset described in Section 4.1 as an example in this section.
11.1 pivot_longer
One of the most used functions in the tidyr package is pivot_longer
, which is useful for converting wide data into tidy data.
As with most tidyverse functions, the pivot_longer
function’s first argument is the data frame that will be converted. Here we want to reshape the wide_data
dataset so that each row represents a fertility observation, which implies we need three columns to store the year, country, and the observed value. In its current form, data from different years are in different columns with the year values stored in the column names. Through the names_to
and values_to
argument we will tell pivot_longer
the column names we want to assign to the columns containing the current column names and observations, respectively. The default names are name
and value
, which are often usable choices. In this case a better choice for these two arguments would be year
and fertility
. Note that nowhere in the data file does it tell us this is fertility data. Instead, we deciphered this from the file name. Through cols
, the second argument, we specify the columns containing observed values; these are the columns that will be pivoted. The default is to pivot all columns so, in most cases, we have to specify the columns. In our example we want columns 1960
, 1961
up to 2015
.
The code to pivot the fertility data therefore looks like this:
new_tidy_data <- wide_data |>
pivot_longer(`1960`:`2015`, names_to = "year", values_to = "fertility")
We can see that the data have been converted to tidy format with columns year
and fertility
head(new_tidy_data)
#> # A tibble: 6 × 3
#> country year fertility
#> <chr> <chr> <dbl>
#> 1 Germany 1960 2.41
#> 2 Germany 1961 2.44
#> 3 Germany 1962 2.47
#> 4 Germany 1963 2.49
#> 5 Germany 1964 2.49
#> # ℹ 1 more row
and that each year resulted in two rows since we have two countries and this column was not pivoted. A somewhat quicker way to write this code is to specify which column will not include in the pivot, rather than all the columns that will be pivoted:
new_tidy_data <- wide_data |>
pivot_longer(-country, names_to = "year", values_to = "fertility")
The new_tidy_data
object looks like the original tidy_data
we defined this way
with just one minor difference. Can you spot it? Look at the data type of the year column. The pivot_longer
function assumes that column names are characters. So we need a bit more wrangling before we are ready to make a plot. We need to convert the year column to be numbers:
new_tidy_data <- wide_data |>
pivot_longer(-country, names_to = "year", values_to = "fertility") |>
mutate(year = as.integer(year))
Now that the data is tidy, we can use this relatively simple ggplot code:
new_tidy_data |>
ggplot(aes(year, fertility, color = country)) +
geom_point()
11.2 pivot_wider
As we will see in later examples, it is sometimes useful for data wrangling purposes to convert tidy data into wide data. We often use this as an intermediate step in tidying up data. The pivot_wider
function is basically the inverse of pivot_longer
. The first argument is for the data, but since we are using the pipe, we don’t show it. The names_from
argument tells pivot_wider
which variable will be used as the column names. The values_from
argument specifies which variable to use to fill out the cells.
new_wide_data <- new_tidy_data |>
pivot_wider(names_from = year, values_from = fertility)
select(new_wide_data, country, `1960`:`1967`)
#> # A tibble: 2 × 9
#> country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37
#> 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85
Similar to pivot_wider
, names_from
and values_from
default to name
and value
.
11.3 Separating variables
The data wrangling shown above was simple compared to what is usually required. In our example spreadsheet files, we include an illustration that is slightly more complicated. It contains two variables: life expectancy and fertility. However, the way it is stored is not tidy and, as we will explain, not optimal.
path <- system.file("extdata", package = "dslabs")
filename <- "life-expectancy-and-fertility-two-countries-example.csv"
filename <- file.path(path, filename)
raw_dat <- read_csv(filename)
select(raw_dat, 1:5)
#> # A tibble: 2 × 5
#> country `1960_fertility` `1960_life_expectancy` `1961_fertility`
#> <chr> <dbl> <dbl> <dbl>
#> 1 Germany 2.41 69.3 2.44
#> 2 South Korea 6.16 53.0 5.99
#> # ℹ 1 more variable: `1961_life_expectancy` <dbl>
First, note that the data is in wide format. Second, notice that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable. Encoding information in the column names is not recommended but, unfortunately, it is quite common. We will put our wrangling skills to work to extract this information and store it in a tidy fashion.
We can start the data wrangling with the pivot_longer
function, but we should no longer use the column name year
for the new column since it also contains the variable type. We will call it name
, the default, for now:
dat <- raw_dat |> pivot_longer(-country)
head(dat)
#> # A tibble: 6 × 3
#> country name value
#> <chr> <chr> <dbl>
#> 1 Germany 1960_fertility 2.41
#> 2 Germany 1960_life_expectancy 69.3
#> 3 Germany 1961_fertility 2.44
#> 4 Germany 1961_life_expectancy 69.8
#> 5 Germany 1962_fertility 2.47
#> # ℹ 1 more row
The result is not exactly what we refer to as tidy since each observation is associated with two, not one, rows. We want to have the values from the two variables, fertility and life expectancy, in two separate columns. The first challenge to achieve this is to separate the name
column into the year and the variable type. Notice that the entries in this column separate the year from the variable name with an underscore:
dat$name[1:5]
#> [1] "1960_fertility" "1960_life_expectancy" "1961_fertility"
#> [4] "1961_life_expectancy" "1962_fertility"
Encoding multiple variables in a column name is such a common problem that the tidyr package includes function to separate these columns into two or more. The separate_wider_delim
function takes three arguments: the name of the column to be separated, the names to be used for the new columns, and the character that separates the variables. So, a first attempt at separating the variable name from the year might be:
dat |> separate_wider_delim(name, delim = "_",
names = c("year", "name"))
However, this line of code will give an error. This is because the life expectancy names have three strings separated by _
and the fertility names have two. This is a common problem so the separate_wider_delim
function has arguments too_few
and too_many
to handle these situations. We see in the help file that the option too_many = merge
will merge together any additional pieces. The following line does what we want:
dat |> separate_wider_delim(name, delim = "_",
names = c("year", "name"),
too_many = "merge")
#> # A tibble: 224 × 4
#> country year name value
#> <chr> <chr> <chr> <dbl>
#> 1 Germany 1960 fertility 2.41
#> 2 Germany 1960 life_expectancy 69.3
#> 3 Germany 1961 fertility 2.44
#> 4 Germany 1961 life_expectancy 69.8
#> 5 Germany 1962 fertility 2.47
#> # ℹ 219 more rows
But we are not done yet. We need to create a column for each variable and change year
to a number. As we learned, the pivot_wider
function can do this:
dat <- dat |>
separate_wider_delim(name, delim = "_",
names = c("year", "name"),
too_many = "merge") |>
pivot_wider() |>
mutate(year = as.integer(year))
dat
#> # A tibble: 112 × 4
#> country year fertility life_expectancy
#> <chr> <int> <dbl> <dbl>
#> 1 Germany 1960 2.41 69.3
#> 2 Germany 1961 2.44 69.8
#> 3 Germany 1962 2.47 70.0
#> 4 Germany 1963 2.49 70.1
#> 5 Germany 1964 2.49 70.7
#> # ℹ 107 more rows
The data is now in tidy format with one row for each observation with three variables: year, fertility, and life expectancy.
Three related function are separate_wider_position
, separate_wider_regex
, and unite
. separate_wider_position
takes a width instead of delimiter. separate_wider_regex
, described in Section 16.4.13, provides much more control over how we separate and what we keep. The unite
function can be thought of as the inverse of the separate
function: it combines two columns into one.
11.4 Reshaping with data.table
In general, everything you can do with tidyverse can be done with data.table and base R which, although perhaps harder to read, it is often more flexible, faster, and more efficient. Here we show how the data.table approach to pivot_longer
, pivot_wider
, and separate
. We will illustrate with the previously used this example:
path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "fertility-two-countries-example.csv")
11.4.1 pivot_longer
is melt
If in tidyverse we write
wide_data <- read_csv(filename)
new_tidy_data <- wide_data |>
pivot_longer(-1, names_to = "year", values_to = "fertility")
in data.table we use the melt
function.
library(data.table)
dt_wide_data <- fread(filename)
dt_new_tidy_data <- melt(dt_wide_data,
measure.vars = 2:ncol(dt_wide_data),
variable.name = "year",
value.name = "fertility")
11.4.2 pivot_wider
is dcast
If in tidyverse we write
new_wide_data <- new_tidy_data |>
pivot_wider(names_from = year, values_from = fertility)
in data.table we use the dcast
function.
dt_new_wide_data <- dcast(dt_new_tidy_data, formula = ... ~ year,
value.var = "fertility")
11.4.3 Separating variables
We now illustrate with this previously used example:
path <- system.file("extdata", package = "dslabs")
filename <- "life-expectancy-and-fertility-two-countries-example.csv"
filename <- file.path(path, filename)
In tidyverse we wrangled using
raw_dat <- read_csv(filename)
dat <- raw_dat |> pivot_longer(-country) |>
separate_wider_delim(name, delim = "_", names = c("year", "name"),
too_many = "merge") |>
pivot_wider() |>
mutate(year = as.integer(year))
In data.table we can use the tstrsplit
function:
dt_raw_dat <- fread(filename)
dat_long <- melt(dt_raw_dat,
measure.vars = which(names(dt_raw_dat) != "country"),
variable.name = "name", value.name = "value")
dat_long[, c("year", "name", "name2") :=
tstrsplit(name, "_", fixed = TRUE, type.convert = TRUE)]
dat_long[is.na(name2), name2 := ""]
dat_long[, name := paste(name, name2, sep = "_")][, name2 := NULL]
dat_wide <- dcast(dat_long, country + year ~ name, value.var = "value")
11.5 The janitor package
The janitor package includes function for some of the most common steps needed to wrangle data. These are particularly useful as these tasks that are often repetitive and time-consuming. Key features include functions for examining and cleaning column names, removing empty or duplicate rows, and converting data types. It also offers capabilities to generate frequency tables and perform cross tabulations with ease. The package is designed to work seamlessly with the tidyverse. Here we show four examples.
Spreadsheets often use names that are not compatible with programming. The most common problem is column names with spaces. The clean_names()
function attempts to fix this and other common problems. By default it forces variable names to be lower case and with underscore instead of space. In this example we change the variable names of the object dat
created in the previous section and then demonstrate how this function works:
Another very common challenging reality is that numeric matrices are saved in spreadsheets and include a column with characters defining the row names. To fix this we have to remove the first column, but only after assigning them as vector that we will use to define rownames after converting the data frame to a matrix. The function column_to_rows
does these operations for us and all we have to do is specify which column contains the rownames:
data.frame(ids = letters[1:3], x = 1:3, y = 4:6) |>
column_to_rownames("ids") |>
as.matrix()
#> x y
#> a 1 4
#> b 2 5
#> c 3 6
Another common challenge is that spreadsheets include the column names as a first row. To quickly fix this we can row_to_names
:
x <- read.csv(file.path(path, "murders.csv"), header = FALSE) |>
row_to_names(1)
names(x)
#> [1] "state" "abb" "region" "population" "total"
Our final example relates to finding duplicates. A very common error in the creation of spreadsheets is that rows are duplicated. The get_dups
function finds and reports duplicate records. By default it considers all variables, but you can also specify which ones to use.
11.6 Exercises
1. Run the following command to define the co2_wide
object:
co2_wide <- data.frame(matrix(co2, ncol = 12, byrow = TRUE)) |>
setNames(1:12) |>
mutate(year = as.character(1959:1997))
Use the pivot_longer
function to wrangle this into a tidy dataset. Call the column with the CO2 measurements co2
and call the month column month
. Call the resulting object co2_tidy
.
2. Plot CO2 versus month with a different curve for each year using this code:
If the expected plot is not made, it is probably because co2_tidy$month
is not numeric:
class(co2_tidy$month)
Rewrite your code to make sure the month column is numeric. Then make the plot.
3. What do we learn from this plot?
- CO2 measures increase monotonically from 1959 to 1997.
- CO2 measures are higher in the summer and the yearly average increased from 1959 to 1997.
- CO2 measures appear constant and random variability explains the differences.
- CO2 measures do not have a seasonal trend.
4. Now load the admissions
data set, which contains admission information for men and women across six majors and keep only the admitted percentage column:
If we think of an observation as a major, and that each observation has two variables (men admitted percentage and women admitted percentage) then this is not tidy. Use the pivot_wider
function to wrangle into tidy shape: one row for each major.
5. Now we will try a more advanced wrangling challenge. We want to wrangle the admissions data so that for each major we have 4 observations: admitted_men
, admitted_women
, applicants_men
and applicants_women
. The trick we perform here is actually quite common: first use pivot_longer
to generate an intermediate data frame and then pivot_wider
to obtain the tidy data we want. We will go step by step in this and the next two exercises.
Use the pivot_longer
function to create a tmp
data frame with a column containing the type of observation: admitted
or applicants
. Call the new columns name
and value
.
6. Now you have an object tmp
with columns major
, gender
, name
and value
. Note that if you combine the name
and gender
, we get the column names we want: admitted_men
, admitted_women
, applicants_men
and applicants_women
. Use the function unite
to create a new column called column_name
.
7. Now use the pivot_wider
function to generate the tidy data with four variables for each major.
8. Now use the pipe to write a line of code that turns admissions
to the table produced in the previous exercise.