16 Data Wrangling in tidyr
In this chapter we will learn about reshaping data to the format most suitable for our data analysis work. To reshape the data, we will use tidyr
package23 which is part of core tidyverse
and can be loaded either by calling library(tidyr)
or library(tidyverse)
.
Prerequisites
16.1 Concepts of tidy data
Hadley Wickham, the chief scientist behind development of RStudio, tidyverse, and much more, introduced the concept of tidy data in a paper24 published in the Journal of Statistical Software.25 Tidy data is a framework to structure data sets so they can be easily analyzed and visualized. It can be thought of as a goal one should aim for when cleaning data. Once we understand what tidy data is, that knowledge will make our data analysis, visualization, and collection much easier.26
A tidy data-set has the following properties:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Diagrammatically27 this can be represented as in figure 16.1.

Figure 16.1: Diagrammatic representation of tidy data
Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.
Consider these five examples28. All these examples29 represent same data but shown in different formats-
#Example-1
tidyr::table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
# Example-2
tidyr::table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
# Example-3
tidyr::table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
# Example-4 (Same data in 2 data tables now)
tidyr::table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
tidyr::table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
# Example-5
tidyr::table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
Let us discuss these example one by one -
-
table1
fulfills all three rules stated above and is thus, in tidy format. Notice that every observation has its own row, and each variable is stored in a separate column. -
table2
stores one observation in two columns (separately for cases and population) and is thus not tidy. -
table3
stores two variables in one column (cases and population together) and is thus not tidy. -
table4a
andtable4b
clearly stores one observation in two different tables and is thus not tidy. We may further notice that both these tables use values as column headers, which also violate rule no.3 stated above. -
table5
again stores one variable i.e. year in two separate columns, and thus does not follow rule no.2 stated above.
Reshaping data
In real world problems we will mostly come across data-sets that are not in tidy formats and for data analysis, visualisation we will need tidying
the datasets. To do this we will first need to understand what actually is a value, a variable/field and an observation. As a second step of tidying we will require to reshape the data by either -
- re-organising the observation originally spread into multiple rows (e.g.
table2
), in one row; OR - re-organising the variable spread into multiple columns (e.g.
table3
, etc.), in one single column.
To perform this tidying exercise, we will need two most important functions from tidyr
i.e. pivot_longer
and pivot_wider
. So let us understand the functioning of these.
16.2 LONGER format through function pivot_longer()
Often we will come across data sets that will have values (instead of actual variable name) as column headers. Let us take the example of table4a
or table4b
shown above. Both these tables have values of variable year
as column names. So we need to re-structure these tables into a longer format where these values form part of columns instead of column names. We will use pivot_longer()
function for this. Th basic syntax is-
pivot_longer(
data,
cols,
names_to = "name",
values_to = "value",
...
)
Note that there many more useful arguments to this function, but first let us consider on these only.
-
cols
indicate names of columns (as a character vector) to be converted into longer format -
names_to = "name"
argument will actually convert values used as column headers back to a column with given “name” -
values_to = "value"
argument will convert values of all those columns back into one column with given name “value” (e.g. population in table4b).
This can be understood through the gif (animated) image 16.2.

Figure 16.2: Data Reshaping to longer format
Basic functionality can be understood using the following example-
iris_summary <- iris %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Width),
st_dev = sd(Sepal.Width))
iris_summary
## # A tibble: 3 × 3
## Species mean st_dev
## <fct> <dbl> <dbl>
## 1 setosa 3.43 0.379
## 2 versicolor 2.77 0.314
## 3 virginica 2.97 0.322
Using pivot_wider
we can convert column headers into values. Check the diagram in figure 16.3.

Figure 16.3: Diagrammatic representation of pivot_longer
Now we are ready to convert table4a
into a tidier format.
Case-I when values are in column headers
pivot_longer(
table4a,
cols = c('1999', '2000'),
names_to = 'year',
values_to = 'cases')
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
With pipes and a little tweaking, the above syntax could have been written as-
tidyr::table4a %>% # first argument passed through pipe
pivot_longer(cols = -country, # all columns except country
names_to = "year",
values_to = "cases")
Case-II when both variables and variable names are combined together as column names
We have seen a simple case to tidy the table when the values (e.g. years) were depicted as column names instead of variables i.e. actual data. There may be cases when column names are a combination of both.
Example - Say we have a table table6
as
## # A tibble: 3 × 5
## country cases_1999 cases_2000 pop_1999 pop_2000
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 745 2666 19987071 20595360
## 2 Brazil 37737 80488 172006362 174504898
## 3 China 212258 213766 1272915272 1280428583
We may use names_sep
argument in this case, which will separate the combined variables from the column names -
table6 %>%
pivot_longer(cols = !country,
names_sep = "_",
names_to = c("count_type", "year"),
values_to = "count")
## # A tibble: 12 × 4
## country count_type year count
## <chr> <chr> <chr> <dbl>
## 1 Afghanistan cases 1999 745
## 2 Afghanistan cases 2000 2666
## 3 Afghanistan pop 1999 19987071
## 4 Afghanistan pop 2000 20595360
## 5 Brazil cases 1999 37737
## 6 Brazil cases 2000 80488
## 7 Brazil pop 1999 172006362
## 8 Brazil pop 2000 174504898
## 9 China cases 1999 212258
## 10 China cases 2000 213766
## 11 China pop 1999 1272915272
## 12 China pop 2000 1280428583
Note that we have two column names in argument
names_to
.
Though the above table is still not in tidy format, yet the example was taken to show the functioning of other arguments of the pivot_longer
. We provided two static column names to the related argument and the variables were created after splitting the headers with sep _
. We actually require one dynamic value to be retained as column name (cases
and pop
here) but need to convert year
to variables.
To do so we will use special value ".value"
in the related argument. Understanding the special value becomes clear with the animated figure 16.4.

Figure 16.4: When headers contain both headers and data combined together
Example-
table6 %>%
pivot_longer(cols = !country,
names_sep = "_",
names_to = c(".value", "year"),
values_to = "count")
## # A tibble: 6 × 4
## country year cases pop
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Note that by using
.value
the argumentvalues_to
becomes meaning less.
16.3 WIDER format through pivot_wider()
As the name suggests, pivot_wider()
does exactly opposite to what a pivot_longer
does. Additionally, this function is used to create summary reports, as pivot functionality in MS Excel, through values_fn
argument. Diagrammatically this can be represented as in figure 16.6. The basic syntax (with commonly used arguments) is-
pivot_wider(
data,
id_cols = NULL,
names_from = name,
values_from = value,
values_fill = NULL,
values_fn = NULL,
...
)
where-
-
id_cols
is a vector of columns that uniquely identifies each observation -
names_from
is a vector of columns to get the name of the output column -
values_from
similarly provides columns to get the cell values from -
values_fill
provides what each value should be filled in with when missing -
values_fn
is a named list - to apply different aggregations to differentvalues_from
columns
Basic functionality of this function can be understood with the following animated figure 16.5.Also note that there are many other arguments for this function, which may be used to deal with complicated tables.

Figure 16.5: Basic functionality of Pivot wider
A diagrammtic representation of the mechanics behind the function is also shown in Figure 16.6.

Figure 16.6: Diagrammatic representation of pivot_wider
Example-1:
tidyr::table2 %>%
pivot_wider(names_from = "type",
values_from = "count")
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Example-2:
tidyr::table2 %>%
pivot_wider(names_from = year,
values_from = count)
## # A tibble: 6 × 4
## country type `1999` `2000`
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan cases 745 2666
## 2 Afghanistan population 19987071 20595360
## 3 Brazil cases 37737 80488
## 4 Brazil population 172006362 174504898
## 5 China cases 212258 213766
## 6 China population 1272915272 1280428583
Example-3: Summarisation
tidyr::table2 %>%
pivot_wider(id_cols = country,
names_from = type,
values_from = count,
values_fn = mean)
## # A tibble: 3 × 3
## country cases population
## <chr> <dbl> <dbl>
## 1 Afghanistan 1706. 20291216.
## 2 Brazil 59112. 173255630
## 3 China 213012 1276671928.
Example-4: Summarisation with different id_cols
tidyr::table2 %>%
pivot_wider(id_cols = year,
names_from = type,
values_from = count,
values_fn = sum)
## # A tibble: 2 × 3
## year cases population
## <dbl> <dbl> <dbl>
## 1 1999 250740 1464908705
## 2 2000 296920 1475528841
Example-5: Use of multiple columns in names_from
argument
tidyr::table2 %>%
pivot_wider(names_from = c(year, type),
values_from = count)
## # A tibble: 3 × 5
## country `1999_cases` `1999_population` `2000_cases` `2000_population`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 745 19987071 2666 20595360
## 2 Brazil 37737 172006362 80488 174504898
## 3 China 212258 1272915272 213766 1280428583
What if order is reversed in names_from
arg. See Example-6:
tidyr::table2 %>%
pivot_wider(names_from = c(type, year),
values_from = count)
## # A tibble: 3 × 5
## country cases_1999 population_1999 cases_2000 population_2000
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 745 19987071 2666 20595360
## 2 Brazil 37737 172006362 80488 174504898
## 3 China 212258 1272915272 213766 1280428583
Example-7: Multiple columns in values_from
tidyr::table1 %>%
pivot_wider(names_from = year,
values_from = c(cases, population))
## # A tibble: 3 × 5
## country cases_1999 cases_2000 population_1999 population_2000
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 745 2666 19987071 20595360
## 2 Brazil 37737 80488 172006362 174504898
## 3 China 212258 213766 1272915272 1280428583
Example-8: Use of names_vary argument to control the order of output columns
tidyr::table1 %>%
pivot_wider(names_from = year,
values_from = c(cases, population),
names_vary = "slowest")
## # A tibble: 3 × 5
## country cases_1999 population_1999 cases_2000 population_2000
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 745 19987071 2666 20595360
## 2 Brazil 37737 172006362 80488 174504898
## 3 China 212258 1272915272 213766 1280428583
For more details please refer to package vignette or Chapter-12 of R for Data Science book.
16.4 Separate Column(s) into multiple columns/ Join columns into one column
Separate a character column into multiple with separate()
As the name suggests, separate()
function is used to separate a given character column into multiple columns either using a regular expression or a vector of character positions.
The syntax is -
separate(
data,
col,
into,
sep = "[^[:alnum:]]+",
remove = TRUE,
convert = FALSE,
extra = "warn",
fill = "warn",
...
)
Explanation of purpose of different arguments in above syntax -
-
data
is as usual name of the data frame -
col
is the name of the column which is required to be separated. -
into
should be a character vector, which usually should be equal length of maximum number of new columns which will be created out of such separation. (Refer examples nos. 1) -
sep
provides a separator value. (Refer Example -3 below). -
remove
ifFALSE
, the original column is not removed from the output. (Refer Example -3 below). -
convert
ifTRUE
, the component columns are converted to double/integer/logical/NA, if possible. This is useful if the component columns are integer, numeric or logical. (Refer Example-1 below). -
extra
argument is used to control when number of desired component columns are less than the maximum possible count. (Refer Example-4 below). -
fill
argument is on the other hand, useful when the number of components are different for each row. (Refer Example-2 below)
Example-1:
tidyr::table3 %>%
separate(rate, into = c("cases", "population"),
convert = TRUE) # optional - will convert the values
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Example-2:
data.frame(
x = c("a", "a+b", "c+d+e")
) %>%
separate(x,
into=c('X1', 'X2', 'X3'),
fill = "left")
## X1 X2 X3
## 1 <NA> <NA> a
## 2 <NA> a b
## 3 c d e
Example-3:
data.frame(
x = c("A$B", "C+D", "E-F")
) %>%
separate(x,
sep = "\\-|\\$",
into = c('X1', 'X2'),
remove = FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
## x X1 X2
## 1 A$B A B
## 2 C+D C+D <NA>
## 3 E-F E F
Example-4:
data.frame(
x = c("a", "a+b", "c+d+e")
) %>%
separate(x,
into=c('X1', 'X2'),
extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
## X1 X2
## 1 a <NA>
## 2 a b
## 3 c d+e
Unite multiple character columns into one using unite()
It complements separate
by uniting the columns into one. Its syntax is
unite(data,
col,
...,
sep = "_",
remove = TRUE,
na.rm = FALSE)
Explanation of arguments in the above syntax-
-
data
is as usual name of the data frame. -
col
should be the name of new column to be formed (should be a string), -
...
the names of columns to be united should be provided -
sep
is separator to be used for uniting -
remove
ifFALSE
, will not remove original component columns -
na.rm
ifTRUE
, the missing values will be removed beforehand.
Example-1a:
## # A tibble: 6 × 5
## country Year century year rate
## <chr> <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 19 99 745/19987071
## 2 Afghanistan 2000 20 00 2666/20595360
## 3 Brazil 1999 19 99 37737/172006362
## 4 Brazil 2000 20 00 80488/174504898
## 5 China 1999 19 99 212258/1272915272
## 6 China 2000 20 00 213766/1280428583
Example-1b: We may complete the tidying process in the next step
tidyr::table5 %>%
unite("Year",
c("century", "year"),
sep = "") %>%
separate(rate,
into = c("cases", "population"),
convert = TRUE)
## # A tibble: 6 × 4
## country Year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Example-
16.5 Separate row(s) into multiple rows
Split data into multiple rows with separate_rows()
This function is used to separate delimited values placed in one single cell/column into multiple rows (as against in rows using separate
).
See Example-
tidyr::table3 %>%
separate_rows(
rate,
sep = "/",
convert = TRUE
)
## # A tibble: 12 × 3
## country year rate
## <chr> <dbl> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 1999 19987071
## 3 Afghanistan 2000 2666
## 4 Afghanistan 2000 20595360
## 5 Brazil 1999 37737
## 6 Brazil 1999 172006362
## 7 Brazil 2000 80488
## 8 Brazil 2000 174504898
## 9 China 1999 212258
## 10 China 1999 1272915272
## 11 China 2000 213766
## 12 China 2000 1280428583
To create type
we have to however, add an extra step to the above example-
tidyr::table3 %>%
separate_rows(
rate,
sep = "/",
convert = TRUE
) %>%
group_by(country, year) %>%
mutate(type = c("cases", "pop"))
## # A tibble: 12 × 4
## # Groups: country, year [6]
## country year rate type
## <chr> <dbl> <int> <chr>
## 1 Afghanistan 1999 745 cases
## 2 Afghanistan 1999 19987071 pop
## 3 Afghanistan 2000 2666 cases
## 4 Afghanistan 2000 20595360 pop
## 5 Brazil 1999 37737 cases
## 6 Brazil 1999 172006362 pop
## 7 Brazil 2000 80488 cases
## 8 Brazil 2000 174504898 pop
## 9 China 1999 212258 cases
## 10 China 1999 1272915272 pop
## 11 China 2000 213766 cases
## 12 China 2000 1280428583 pop
16.6 Expand table to handle missing rows/values
Sometimes, when we deal with missing data, we require to handle implicit missing values as well. Either we have to turn these values into explicit missing values or we have to fill appropriate values. In such cases, two functions namely complete
and fill
both from same package are extremely useful. Let’s learn these as well.
Turn implicit missing values to explicit using complete()
As th name suggests, this function is used to turn implicit missing values (invisible rows) to explicit missing values (visible rows with NA
).
As an example, let’s suppose fuel prices are revised randomly. Say, after revision on 1 January 2020
prices revise on 20 January 2020
. So we will have only 2 rows in data for say Janaury 2020
. Thus, there are 29 implicit missing values in the data.
The syntax is
complete(data,
...,
fill = list(),
explicit = TRUE)
Where -
-
data
is as usual argument to provide data frame -
...
are meant to provide column names to be completed -
fill
provides a list to supply a single value which can be provided instead ofNA
for missing combinations.
In the example mentioned above we can proceed as
# First create a sample data
df <- data.frame(
date = c(as.Date("2020-01-01"), as.Date("2020-01-20")),
price = c(75.12, 78.32)
)
df
## date price
## 1 2020-01-01 75.12
## 2 2020-01-20 78.32
# Use tidyr::complete to see explicit missing values
df %>%
complete(date = seq.Date(as.Date("2020-01-01"), as.Date("2020-01-31"), by = "day"))
## # A tibble: 31 × 2
## date price
## <date> <dbl>
## 1 2020-01-01 75.1
## 2 2020-01-02 NA
## 3 2020-01-03 NA
## 4 2020-01-04 NA
## 5 2020-01-05 NA
## 6 2020-01-06 NA
## 7 2020-01-07 NA
## 8 2020-01-08 NA
## 9 2020-01-09 NA
## 10 2020-01-10 NA
## # ℹ 21 more rows
Though the above example created dates as per the criteria given, complete
function can find all unique combinations in the set of columns provided and return complete set of observations. See this example
#Let's create a sample data
set.seed(123)
df2 <- data.frame(
year = c(2020, 2020, 2020, 2021, 2021),
qtr = c(1,3,4,2,3),
sales = runif(5, 100, 200)
)
df2
## year qtr sales
## 1 2020 1 128.7578
## 2 2020 3 178.8305
## 3 2020 4 140.8977
## 4 2021 2 188.3017
## 5 2021 3 194.0467
# use complete to find all combination
df2 %>%
complete(year, qtr, # cols provided
fill = list(sales = 0))
## # A tibble: 8 × 3
## year qtr sales
## <dbl> <dbl> <dbl>
## 1 2020 1 129.
## 2 2020 2 0
## 3 2020 3 179.
## 4 2020 4 141.
## 5 2021 1 0
## 6 2021 2 188.
## 7 2021 3 194.
## 8 2021 4 0
Note: If fill
argument would not have been used, the value of sales in missing columns would have been NA
instead of provided value.
Fill missing values based on criteria fill()
This function helps in filling the missing values using previous or next entry. Think of a paper sheet where many cells in a table have been filled as "-do-"
. Syntax is -
fill(data, # used to provide data
..., # provide columns to be filled
.direction = c("down", "up", "downup", "updown"))
The arguments are pretty simple. Most of the time "down"
method is used. As an example we could see the example of fuel prices mentioned above.
fill_df <- df %>%
complete(date = seq.Date(as.Date("2020-01-01"), as.Date("2020-01-31"), by = "day")) %>%
fill(price, .direction = "down")
head(fill_df)
## # A tibble: 6 × 2
## date price
## <date> <dbl>
## 1 2020-01-01 75.1
## 2 2020-01-02 75.1
## 3 2020-01-03 75.1
## 4 2020-01-04 75.1
## 5 2020-01-05 75.1
## 6 2020-01-06 75.1
tail(fill_df)
## # A tibble: 6 × 2
## date price
## <date> <dbl>
## 1 2020-01-26 78.3
## 2 2020-01-27 78.3
## 3 2020-01-28 78.3
## 4 2020-01-29 78.3
## 5 2020-01-30 78.3
## 6 2020-01-31 78.3