16 Data Wrangling in tidyr

In this chapter we will learn about reshaping data to the format most suitable for our data analysis work. To reshape the data, we will use tidyr package23 which is part of core tidyverse and can be loaded either by calling library(tidyr) or library(tidyverse).

Prerequisites

16.1 Concepts of tidy data

Hadley Wickham, the chief scientist behind development of RStudio, tidyverse, and much more, introduced the concept of tidy data in a paper24 published in the Journal of Statistical Software.25 Tidy data is a framework to structure data sets so they can be easily analyzed and visualized. It can be thought of as a goal one should aim for when cleaning data. Once we understand what tidy data is, that knowledge will make our data analysis, visualization, and collection much easier.26

A tidy data-set has the following properties:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Diagrammatically27 this can be represented as in figure 16.1.

Diagrammatic representation of tidy data

Figure 16.1: Diagrammatic representation of tidy data

Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.

Consider these five examples28. All these examples29 represent same data but shown in different formats-

#Example-1
tidyr::table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
# Example-2
tidyr::table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
# Example-3
tidyr::table3
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
# Example-4 (Same data in 2 data tables now)
tidyr::table4a
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
tidyr::table4b
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
# Example-5
tidyr::table5
## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

Let us discuss these example one by one -

  • table1 fulfills all three rules stated above and is thus, in tidy format. Notice that every observation has its own row, and each variable is stored in a separate column.
  • table2 stores one observation in two columns (separately for cases and population) and is thus not tidy.
  • table3 stores two variables in one column (cases and population together) and is thus not tidy.
  • table4a and table4b clearly stores one observation in two different tables and is thus not tidy. We may further notice that both these tables use values as column headers, which also violate rule no.3 stated above.
  • table5 again stores one variable i.e. year in two separate columns, and thus does not follow rule no.2 stated above.
Reshaping data

In real world problems we will mostly come across data-sets that are not in tidy formats and for data analysis, visualisation we will need tidying the datasets. To do this we will first need to understand what actually is a value, a variable/field and an observation. As a second step of tidying we will require to reshape the data by either -

  • re-organising the observation originally spread into multiple rows (e.g. table2), in one row; OR
  • re-organising the variable spread into multiple columns (e.g. table3, etc.), in one single column.

To perform this tidying exercise, we will need two most important functions from tidyr i.e. pivot_longer and pivot_wider. So let us understand the functioning of these.

16.2 LONGER format through function pivot_longer()

Often we will come across data sets that will have values (instead of actual variable name) as column headers. Let us take the example of table4a or table4b shown above. Both these tables have values of variable year as column names. So we need to re-structure these tables into a longer format where these values form part of columns instead of column names. We will use pivot_longer() function for this. Th basic syntax is-

pivot_longer(
  data,
  cols,
  names_to = "name",
  values_to = "value",
  ...
)

Note that there many more useful arguments to this function, but first let us consider on these only.

  • cols indicate names of columns (as a character vector) to be converted into longer format
  • names_to = "name" argument will actually convert values used as column headers back to a column with given “name”
  • values_to = "value" argument will convert values of all those columns back into one column with given name “value” (e.g. population in table4b).

This can be understood through the gif (animated) image 16.2.

Data Reshaping to longer format

Figure 16.2: Data Reshaping to longer format

Basic functionality can be understood using the following example-

iris_summary <- iris %>% 
  group_by(Species) %>% 
  summarise(mean = mean(Sepal.Width),
            st_dev = sd(Sepal.Width))
iris_summary
## # A tibble: 3 × 3
##   Species     mean st_dev
##   <fct>      <dbl>  <dbl>
## 1 setosa      3.43  0.379
## 2 versicolor  2.77  0.314
## 3 virginica   2.97  0.322

Using pivot_wider we can convert column headers into values. Check the diagram in figure 16.3.

Diagrammatic representation of pivot\_longer

Figure 16.3: Diagrammatic representation of pivot_longer

Now we are ready to convert table4a into a tidier format.

Case-I when values are in column headers

pivot_longer(
  table4a, 
  cols = c('1999', '2000'), 
  names_to = 'year', 
  values_to = 'cases')
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

With pipes and a little tweaking, the above syntax could have been written as-

tidyr::table4a %>%                         # first argument passed through pipe 
  pivot_longer(cols = -country,     # all columns except country
               names_to = "year",
               values_to = "cases")

Case-II when both variables and variable names are combined together as column names

We have seen a simple case to tidy the table when the values (e.g. years) were depicted as column names instead of variables i.e. actual data. There may be cases when column names are a combination of both.

Example - Say we have a table table6 as

## # A tibble: 3 × 5
##   country     cases_1999 cases_2000   pop_1999   pop_2000
##   <chr>            <dbl>      <dbl>      <dbl>      <dbl>
## 1 Afghanistan        745       2666   19987071   20595360
## 2 Brazil           37737      80488  172006362  174504898
## 3 China           212258     213766 1272915272 1280428583

We may use names_sep argument in this case, which will separate the combined variables from the column names -

table6 %>% 
  pivot_longer(cols = !country,
               names_sep = "_",
               names_to = c("count_type", "year"),
               values_to = "count")
## # A tibble: 12 × 4
##    country     count_type year       count
##    <chr>       <chr>      <chr>      <dbl>
##  1 Afghanistan cases      1999         745
##  2 Afghanistan cases      2000        2666
##  3 Afghanistan pop        1999    19987071
##  4 Afghanistan pop        2000    20595360
##  5 Brazil      cases      1999       37737
##  6 Brazil      cases      2000       80488
##  7 Brazil      pop        1999   172006362
##  8 Brazil      pop        2000   174504898
##  9 China       cases      1999      212258
## 10 China       cases      2000      213766
## 11 China       pop        1999  1272915272
## 12 China       pop        2000  1280428583

Note that we have two column names in argument names_to.

Though the above table is still not in tidy format, yet the example was taken to show the functioning of other arguments of the pivot_longer. We provided two static column names to the related argument and the variables were created after splitting the headers with sep _. We actually require one dynamic value to be retained as column name (cases and pop here) but need to convert year to variables.

To do so we will use special value ".value" in the related argument. Understanding the special value becomes clear with the animated figure 16.4.

When headers contain both headers and data combined together

Figure 16.4: When headers contain both headers and data combined together

Example-

table6 %>% 
  pivot_longer(cols = !country,
               names_sep = "_",
               names_to = c(".value", "year"),
               values_to = "count")
## # A tibble: 6 × 4
##   country     year   cases        pop
##   <chr>       <chr>  <dbl>      <dbl>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

Note that by using .value the argument values_to becomes meaning less.

16.3 WIDER format through pivot_wider()

As the name suggests, pivot_wider() does exactly opposite to what a pivot_longer does. Additionally, this function is used to create summary reports, as pivot functionality in MS Excel, through values_fn argument. Diagrammatically this can be represented as in figure 16.6. The basic syntax (with commonly used arguments) is-

pivot_wider(
  data,
  id_cols = NULL,
  names_from = name,
  values_from = value,
  values_fill = NULL,
  values_fn = NULL,
  ...
)

where-

  • id_cols is a vector of columns that uniquely identifies each observation
  • names_from is a vector of columns to get the name of the output column
  • values_from similarly provides columns to get the cell values from
  • values_fill provides what each value should be filled in with when missing
  • values_fn is a named list - to apply different aggregations to different values_from columns

Also note that there are many other arguments for this function, which may be used to deal with complicated tables.

Basic functionality of this function can be understood with the following animated figure 16.5.
Basic functionality of Pivot wider

Figure 16.5: Basic functionality of Pivot wider

A diagrammtic representation of the mechanics behind the function is also shown in Figure 16.6.

Diagrammatic representation of pivot\_wider

Figure 16.6: Diagrammatic representation of pivot_wider

Example-1:

tidyr::table2 %>% 
  pivot_wider(names_from = "type",
              values_from = "count")
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Example-2:

tidyr::table2 %>% 
  pivot_wider(names_from = year,
              values_from = count)
## # A tibble: 6 × 4
##   country     type           `1999`     `2000`
##   <chr>       <chr>           <dbl>      <dbl>
## 1 Afghanistan cases             745       2666
## 2 Afghanistan population   19987071   20595360
## 3 Brazil      cases           37737      80488
## 4 Brazil      population  172006362  174504898
## 5 China       cases          212258     213766
## 6 China       population 1272915272 1280428583

Example-3: Summarisation

tidyr::table2 %>% 
  pivot_wider(id_cols = country,
              names_from = type,
              values_from = count,
              values_fn = mean)
## # A tibble: 3 × 3
##   country       cases  population
##   <chr>         <dbl>       <dbl>
## 1 Afghanistan   1706.   20291216.
## 2 Brazil       59112.  173255630 
## 3 China       213012  1276671928.

Example-4: Summarisation with different id_cols

tidyr::table2 %>% 
  pivot_wider(id_cols = year,
              names_from = type,
              values_from = count,
              values_fn = sum)
## # A tibble: 2 × 3
##    year  cases population
##   <dbl>  <dbl>      <dbl>
## 1  1999 250740 1464908705
## 2  2000 296920 1475528841

Example-5: Use of multiple columns in names_from argument

tidyr::table2 %>% 
  pivot_wider(names_from = c(year, type),
               values_from = count)
## # A tibble: 3 × 5
##   country     `1999_cases` `1999_population` `2000_cases` `2000_population`
##   <chr>              <dbl>             <dbl>        <dbl>             <dbl>
## 1 Afghanistan          745          19987071         2666          20595360
## 2 Brazil             37737         172006362        80488         174504898
## 3 China             212258        1272915272       213766        1280428583

What if order is reversed in names_from arg. See Example-6:

tidyr::table2 %>% 
  pivot_wider(names_from = c(type, year),
               values_from = count)
## # A tibble: 3 × 5
##   country     cases_1999 population_1999 cases_2000 population_2000
##   <chr>            <dbl>           <dbl>      <dbl>           <dbl>
## 1 Afghanistan        745        19987071       2666        20595360
## 2 Brazil           37737       172006362      80488       174504898
## 3 China           212258      1272915272     213766      1280428583

Example-7: Multiple columns in values_from

tidyr::table1 %>% 
  pivot_wider(names_from = year,
              values_from = c(cases, population))
## # A tibble: 3 × 5
##   country     cases_1999 cases_2000 population_1999 population_2000
##   <chr>            <dbl>      <dbl>           <dbl>           <dbl>
## 1 Afghanistan        745       2666        19987071        20595360
## 2 Brazil           37737      80488       172006362       174504898
## 3 China           212258     213766      1272915272      1280428583

Example-8: Use of names_vary argument to control the order of output columns

tidyr::table1 %>% 
  pivot_wider(names_from = year,
              values_from = c(cases, population),
              names_vary = "slowest")
## # A tibble: 3 × 5
##   country     cases_1999 population_1999 cases_2000 population_2000
##   <chr>            <dbl>           <dbl>      <dbl>           <dbl>
## 1 Afghanistan        745        19987071       2666        20595360
## 2 Brazil           37737       172006362      80488       174504898
## 3 China           212258      1272915272     213766      1280428583

For more details please refer to package vignette or Chapter-12 of R for Data Science book.

16.4 Separate Column(s) into multiple columns/ Join columns into one column

Separate a character column into multiple with separate()

As the name suggests, separate() function is used to separate a given character column into multiple columns either using a regular expression or a vector of character positions. The syntax is -

separate(
  data,
  col,
  into,
  sep = "[^[:alnum:]]+",
  remove = TRUE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
  ...
)

Explanation of purpose of different arguments in above syntax -

  1. data is as usual name of the data frame
  2. col is the name of the column which is required to be separated.
  3. into should be a character vector, which usually should be equal length of maximum number of new columns which will be created out of such separation. (Refer examples nos. 1)
  4. sep provides a separator value. (Refer Example -3 below).
  5. remove if FALSE, the original column is not removed from the output. (Refer Example -3 below).
  6. convert if TRUE, the component columns are converted to double/integer/logical/NA, if possible. This is useful if the component columns are integer, numeric or logical. (Refer Example-1 below).
  7. extra argument is used to control when number of desired component columns are less than the maximum possible count. (Refer Example-4 below).
  8. fill argument is on the other hand, useful when the number of components are different for each row. (Refer Example-2 below)

Example-1:

tidyr::table3 %>% 
  separate(rate, into = c("cases", "population"),
           convert = TRUE) # optional - will convert the values
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Example-2:

data.frame(
  x = c("a", "a+b", "c+d+e")
) %>% 
  separate(x,
           into=c('X1', 'X2', 'X3'),
           fill = "left")
##     X1   X2 X3
## 1 <NA> <NA>  a
## 2 <NA>    a  b
## 3    c    d  e

Example-3:

data.frame(
  x = c("A$B", "C+D", "E-F")
) %>% 
  separate(x, 
           sep = "\\-|\\$",
           into = c('X1', 'X2'),
           remove = FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
##     x  X1   X2
## 1 A$B   A    B
## 2 C+D C+D <NA>
## 3 E-F   E    F

Example-4:

data.frame(
  x = c("a", "a+b", "c+d+e")
) %>% 
  separate(x,
           into=c('X1', 'X2'),
           extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
##   X1   X2
## 1  a <NA>
## 2  a    b
## 3  c  d+e

Unite multiple character columns into one using unite()

It complements separate by uniting the columns into one. Its syntax is

unite(data, 
      col, 
      ..., 
      sep = "_", 
      remove = TRUE, 
      na.rm = FALSE)

Explanation of arguments in the above syntax-

  1. data is as usual name of the data frame.
  2. col should be the name of new column to be formed (should be a string),
  3. ... the names of columns to be united should be provided
  4. sep is separator to be used for uniting
  5. remove if FALSE, will not remove original component columns
  6. na.rm if TRUE, the missing values will be removed beforehand.

Example-1a:

tidyr::table5 %>% 
  unite("Year",
        c("century", "year"), 
        sep = "",
        remove = FALSE)
## # A tibble: 6 × 5
##   country     Year  century year  rate             
##   <chr>       <chr> <chr>   <chr> <chr>            
## 1 Afghanistan 1999  19      99    745/19987071     
## 2 Afghanistan 2000  20      00    2666/20595360    
## 3 Brazil      1999  19      99    37737/172006362  
## 4 Brazil      2000  20      00    80488/174504898  
## 5 China       1999  19      99    212258/1272915272
## 6 China       2000  20      00    213766/1280428583

Example-1b: We may complete the tidying process in the next step

tidyr::table5 %>% 
  unite("Year",
        c("century", "year"), 
        sep = "") %>% 
  separate(rate, 
           into = c("cases", "population"),
           convert = TRUE)
## # A tibble: 6 × 4
##   country     Year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

Example-

16.5 Separate row(s) into multiple rows

Split data into multiple rows with separate_rows()

This function is used to separate delimited values placed in one single cell/column into multiple rows (as against in rows using separate). See Example-

tidyr::table3 %>% 
  separate_rows(
    rate,
    sep = "/",
    convert = TRUE
  )
## # A tibble: 12 × 3
##    country      year       rate
##    <chr>       <dbl>      <int>
##  1 Afghanistan  1999        745
##  2 Afghanistan  1999   19987071
##  3 Afghanistan  2000       2666
##  4 Afghanistan  2000   20595360
##  5 Brazil       1999      37737
##  6 Brazil       1999  172006362
##  7 Brazil       2000      80488
##  8 Brazil       2000  174504898
##  9 China        1999     212258
## 10 China        1999 1272915272
## 11 China        2000     213766
## 12 China        2000 1280428583

To create type we have to however, add an extra step to the above example-

tidyr::table3 %>% 
  separate_rows(
    rate,
    sep = "/",
    convert = TRUE
  ) %>% 
  group_by(country, year) %>% 
  mutate(type = c("cases", "pop"))
## # A tibble: 12 × 4
## # Groups:   country, year [6]
##    country      year       rate type 
##    <chr>       <dbl>      <int> <chr>
##  1 Afghanistan  1999        745 cases
##  2 Afghanistan  1999   19987071 pop  
##  3 Afghanistan  2000       2666 cases
##  4 Afghanistan  2000   20595360 pop  
##  5 Brazil       1999      37737 cases
##  6 Brazil       1999  172006362 pop  
##  7 Brazil       2000      80488 cases
##  8 Brazil       2000  174504898 pop  
##  9 China        1999     212258 cases
## 10 China        1999 1272915272 pop  
## 11 China        2000     213766 cases
## 12 China        2000 1280428583 pop

16.6 Expand table to handle missing rows/values

Sometimes, when we deal with missing data, we require to handle implicit missing values as well. Either we have to turn these values into explicit missing values or we have to fill appropriate values. In such cases, two functions namely complete and fill both from same package are extremely useful. Let’s learn these as well.

Turn implicit missing values to explicit using complete()

As th name suggests, this function is used to turn implicit missing values (invisible rows) to explicit missing values (visible rows with NA).

As an example, let’s suppose fuel prices are revised randomly. Say, after revision on 1 January 2020 prices revise on 20 January 2020. So we will have only 2 rows in data for say Janaury 2020. Thus, there are 29 implicit missing values in the data.

The syntax is

complete(data, 
         ..., 
         fill = list(), 
         explicit = TRUE)

Where -

  • data is as usual argument to provide data frame
  • ... are meant to provide column names to be completed
  • fill provides a list to supply a single value which can be provided instead of NA for missing combinations.

In the example mentioned above we can proceed as

# First create a sample data
df <- data.frame(
  date = c(as.Date("2020-01-01"), as.Date("2020-01-20")),
  price = c(75.12, 78.32)
)
df 
##         date price
## 1 2020-01-01 75.12
## 2 2020-01-20 78.32
# Use tidyr::complete to see explicit missing values
df %>% 
  complete(date = seq.Date(as.Date("2020-01-01"), as.Date("2020-01-31"), by = "day"))
## # A tibble: 31 × 2
##    date       price
##    <date>     <dbl>
##  1 2020-01-01  75.1
##  2 2020-01-02  NA  
##  3 2020-01-03  NA  
##  4 2020-01-04  NA  
##  5 2020-01-05  NA  
##  6 2020-01-06  NA  
##  7 2020-01-07  NA  
##  8 2020-01-08  NA  
##  9 2020-01-09  NA  
## 10 2020-01-10  NA  
## # ℹ 21 more rows

Though the above example created dates as per the criteria given, complete function can find all unique combinations in the set of columns provided and return complete set of observations. See this example

#Let's create a sample data
set.seed(123)
df2 <- data.frame(
  year = c(2020, 2020, 2020, 2021, 2021),
  qtr = c(1,3,4,2,3),
  sales = runif(5, 100, 200)
)
df2
##   year qtr    sales
## 1 2020   1 128.7578
## 2 2020   3 178.8305
## 3 2020   4 140.8977
## 4 2021   2 188.3017
## 5 2021   3 194.0467
# use complete to find all combination
df2 %>% 
  complete(year, qtr, # cols provided
           fill = list(sales = 0))
## # A tibble: 8 × 3
##    year   qtr sales
##   <dbl> <dbl> <dbl>
## 1  2020     1  129.
## 2  2020     2    0 
## 3  2020     3  179.
## 4  2020     4  141.
## 5  2021     1    0 
## 6  2021     2  188.
## 7  2021     3  194.
## 8  2021     4    0

Note: If fill argument would not have been used, the value of sales in missing columns would have been NA instead of provided value.

Fill missing values based on criteria fill()

This function helps in filling the missing values using previous or next entry. Think of a paper sheet where many cells in a table have been filled as "-do-". Syntax is -

fill(data, # used to provide data
     ..., # provide columns to be filled
     .direction = c("down", "up", "downup", "updown"))

The arguments are pretty simple. Most of the time "down" method is used. As an example we could see the example of fuel prices mentioned above.

fill_df <- df %>% 
  complete(date = seq.Date(as.Date("2020-01-01"), as.Date("2020-01-31"), by = "day")) %>% 
  fill(price, .direction = "down")

head(fill_df)
## # A tibble: 6 × 2
##   date       price
##   <date>     <dbl>
## 1 2020-01-01  75.1
## 2 2020-01-02  75.1
## 3 2020-01-03  75.1
## 4 2020-01-04  75.1
## 5 2020-01-05  75.1
## 6 2020-01-06  75.1
tail(fill_df)
## # A tibble: 6 × 2
##   date       price
##   <date>     <dbl>
## 1 2020-01-26  78.3
## 2 2020-01-27  78.3
## 3 2020-01-28  78.3
## 4 2020-01-29  78.3
## 5 2020-01-30  78.3
## 6 2020-01-31  78.3