15 Combining Tables/tabular data

Figure 15.1: Most of times, joining two or more tables will be required to perform analytics
In real world scenarios, there may hardly be a case when we have to analyse one single table. There may be cases when we have to either join tables split into multiple smaller tables (e.g. we can have smaller tables split States-wise), or the tables may be divided into various smaller master and transaction tables (relational databases).
We may thus divide the data tables joining requirements into three broad categories-
- Simple joins or concatenation
- Relational Joins
- Filtering Joins
Let us discuss each of these with examples.
15.1 Simple joins/concatenation
Many times tables split into smaller tables have to be joined back before proceeding further for data analytics. We may have to join two or more tables either columnwise (e.g. some of the features for all rows have been split into a separate table) or row wise (e.g. all the fields/columns are split into smaller tables like a separate table for each State). Diagramatically these joins may be depicted as shown in figure 15.2.

Figure 15.2: Illustration of Simple joins/concatenation
15.1.1 Column binding
As we have already seen that data frames act like matrices in many ways except to that fact that these support heterogeneous data unlike matrices. We have also discussed the ways two matrices can be joined. Base R has two dedicated functions i.e. cbind()
and rbind()
for these operations.
In tidyverse (dplyr specifically) we have two similar functions bind_cols()
and bind_rows
respectively which provide us better functionality for these use cases. The syntax for finction bind_cols()
used for concatenating two or more tables column wise is -
bind_cols(
...,
.name_repair = c("unique", "universal", "check_unique")
)
Where -
-
...
represent data frames to be combined -
.name_repair
argument chooses method to rename duplicate column names, if any.
Example-1:
## New names:
## • `Species` -> `Species...3`
## • `Species` -> `Species...6`
## Sepal.Length Sepal.Width Species...3 Petal.Length Petal.Width Species...6
## 1 5.1 3.5 setosa 1.4 0.2 setosa
## 2 4.9 3.0 setosa 1.4 0.2 setosa
## 3 4.7 3.2 setosa 1.3 0.2 setosa
Note: The data frames to be merged should be row-consistent for column binding. Try this
bind_cols(iris[1:3, 1:2], iris[1:4, 3:4])
and see the results.
15.1.2 Row binding
The syntax used for appending rows of one or more tables together in one table is -
bind_rows(..., .id = NULL)
where -
-
...
represent data frames to be combined -
.id
argument creates a new column of identifiers.
To understand it better, let us see this example
setosa <- iris[1:3, 1:4]
versicolor <- iris[51:53, 1:4]
virginica <- iris[101:103, 1:4]
bind_rows(setosa, versicolor, virginica, .id = 'groups')
## groups Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 5.1 3.5 1.4 0.2
## 2 1 4.9 3.0 1.4 0.2
## 3 1 4.7 3.2 1.3 0.2
## 4 2 7.0 3.2 4.7 1.4
## 5 2 6.4 3.2 4.5 1.5
## 6 2 6.9 3.1 4.9 1.5
## 7 3 6.3 3.3 6.0 2.5
## 8 3 5.8 2.7 5.1 1.9
## 9 3 7.1 3.0 5.9 2.1
Note: In the above example if the requirement is to store data tables names into the new identifier column just list convert the databases into a list and
.id
will take element_names as the values in identifiers. Try thisbind_rows(list(setosa=setosa, versicolor=versicolor, virginica=virginica), .id = 'Species')
15.2 Relational joins
Relational Joins are usually needed to join multiple tables based on primary key and secondary keys. The joins may either be one-to-one key join or one-to-many key joins. Broadly these can either be inner joins or outer joins. Diagrammatically these may be represented as shown in figure 15.3 wherein it may be noticed that additional columns from right
data frame gets appended to left
data frame. For this reason, these four joins are also known as mutating joins.

Figure 15.3: Illustration of Mutating Joins in dplyr
The syntax of all these joins is nearly same-
*_join(x, y, by = NULL, copy = FALSE, suffix = c('.x', '.y'), ... , keep = FALSE)
where -
-
x
andy
are data frames to be joined -
by
is a character vector of column names to be joined by -
suffix
argument provides suffixes to be added to column names, if any of those are duplicate -
keep
argument decides whether the join keys from both x and y be preserved in the output?
Let’s discuss each of these joins individually.
15.2.1 Inner Joins
Inner Joins keeps only those rows where matching keys are present in both the data frames.
Example-1:
band_members
## # A tibble: 3 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 × 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
inner_join(band_members, band_instruments)
## Joining with `by = join_by(name)`
## # A tibble: 2 × 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
15.2.2 Left Joins
Left Joins on the other hand preserves all rows of data frame passed as x
i.e. first argument irrespective of the fact that matching key record is available in second data table or not.
Example-1:
left_join(band_members, band_instruments)
## Joining with `by = join_by(name)`
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
15.2.3 Right Joins
Right join, is similar to left join and preserves all rows of data frame passed as y
i.e. second argument irrespective of the fact that matching key record is available in first data table or not.
Example-1:
right_join(band_members, band_instruments)
## Joining with `by = join_by(name)`
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith <NA> guitar
15.2.4 Full Joins
Full join returns all the rows of both the data tables despite non-availability of matching key in either of the tables.
Example-1:
full_join(band_members, band_instruments)
## Joining with `by = join_by(name)`
## # A tibble: 4 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
You must have noticed that each of the examples shown above has thrown a warning that join has been performed on variable name
. We may override this warning by specifically providing the joining key column name(s) in by
argument i.e. by = "name"
.
There may be cases when the joining key column(s) in the two data frames are of different names. These cases can also be handled by using by
argument.
Example-
band_instruments2
## # A tibble: 3 × 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
Note: Each of the
*_join()
can be joined on multiple keys/columns (i.e. more than one) usingby
argument as explained above.
Many-to-many Joins
Example of four of the above-mentioned joins are many to many joins and thus need to be used carefully. See the following example
df1 <- data.frame(
x = c(1, 1, NA),
y = c(11, 12, 21)
)
df1
## x y
## 1 1 11
## 2 1 12
## 3 NA 21
df2 <- data.frame(
x = c(1, 1, NA, NA),
z = c(101, 102, 201, 202)
)
df2
## x z
## 1 1 101
## 2 1 102
## 3 NA 201
## 4 NA 202
## Warning in left_join(., df2, by = "x"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
## x y z
## 1 1 11 101
## 2 1 11 102
## 3 1 12 101
## 4 1 12 102
## 5 NA 21 201
## 6 NA 21 202
In fact, we can use argument relationship
explicitly to silence this warning and avoid errors. This argument can take one of these values
-
NULL
(default) "one-to-one"
"one-to-many"
"many-to-one"
"many-to-many"
Example above re-coded
## x y z
## 1 1 11 101
## 2 1 11 102
## 3 1 12 101
## 4 1 12 102
## 5 NA 21 201
## 6 NA 21 202
15.3 Filtering Joins
The other joins available in dplyr
are basically filtering joins. These joins will not result in appending additional columns present in right
data frame to left
data frame. Only the common keys (columns) from right data frame will be used to filter left data frame.
15.3.1 Semi Joins
First of these is semi_join
which essentially filters those rows from a data frame, which are based on another data frame. See this example-
df1 <- data.frame(
x = c(1, 2, NA),
y = c(11, 21, 100)
)
df1
## x y
## 1 1 11
## 2 2 21
## 3 NA 100
df2 <- data.frame(
x = c(1, 3, 4, NA),
z = c(101, 301, 401, 501)
)
df2
## x z
## 1 1 101
## 2 3 301
## 3 4 401
## 4 NA 501
## x y
## 1 1 11
## 2 NA 100
15.4 Relational, but non-equi joins
Till now, we have learnt joins that use syntax of dplyr version before 1.1.0. However, in dplyr newest version, 1.1.0, released in January 2023, a new helper function join_by()
has been introduced. It actually, constructs a specification that describes how to join two tables using a small domain specific language. The result can be supplied as the by argument to any of the join functions that we have learnt above. For equal joins, we have learnt above, there are two slight but easier to implement changes if we use join_by()
-
- we may pass variable names in by argument instead of using characters as described above.
- instead of using
=
we have to use==
which is actually the equality operator as opposed to assignment operator we were using earlier.
So our syntax for left_join
example is
band_instruments2
## # A tibble: 3 × 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
Simple, isn’t it. But now using the potential of this new helper function, we can on the other hand join two (or more data frames) using inequality conditions.
15.4.1 Inequality Joins
Inequality joins match on an inequality, such as >
, >=
, <
, or <=
, and are common in time series analysis and genomics. To construct an inequality join using join_by()
, we have to supply two column names separated by one of the above mentioned inequalities. See the following example for better understanding of inequality joins.
Example- Suppose we have two tables, one containing product-wise sales
and another containing rates
as and when these are revised. So, in order to have total sales amount, we can use inequality join.
# sales
sales <- data.frame(
product.id = c("A", "A", "A", "A", "A", "B", "B"),
sales.date = as.Date(c("01-01-2022",
"02-02-2022","05-11-2022","05-04-2022",
"05-10-2022","01-02-2022","01-01-2023"), format = "%d-%m-%Y"),
quantity = c(5L, 10L, 15L, 5L, 15L, 1L, 2L)
)
# Print it
sales
## product.id sales.date quantity
## 1 A 2022-01-01 5
## 2 A 2022-02-02 10
## 3 A 2022-11-05 15
## 4 A 2022-04-05 5
## 5 A 2022-10-05 15
## 6 B 2022-02-01 1
## 7 B 2023-01-01 2
# rates
rates <- data.frame(
product.id = c("A", "A", "A", "B", "B"),
revision.date = as.Date(c("01-02-2022",
"03-04-2022","05-10-2022","01-01-2022",
"05-10-2022"), format = "%d-%m-%Y"),
rate = c(50L, 60L, 70L, 500L, 1500L)
)
#print it
rates
## product.id revision.date rate
## 1 A 2022-02-01 50
## 2 A 2022-04-03 60
## 3 A 2022-10-05 70
## 4 B 2022-01-01 500
## 5 B 2022-10-05 1500
# total sale(1st attempt)
sales %>%
left_join(rates,
by = join_by(product.id,
sales.date >= revision.date))
## product.id sales.date quantity revision.date rate
## 1 A 2022-01-01 5 <NA> NA
## 2 A 2022-02-02 10 2022-02-01 50
## 3 A 2022-11-05 15 2022-02-01 50
## 4 A 2022-11-05 15 2022-04-03 60
## 5 A 2022-11-05 15 2022-10-05 70
## 6 A 2022-04-05 5 2022-02-01 50
## 7 A 2022-04-05 5 2022-04-03 60
## 8 A 2022-10-05 15 2022-02-01 50
## 9 A 2022-10-05 15 2022-04-03 60
## 10 A 2022-10-05 15 2022-10-05 70
## 11 B 2022-02-01 1 2022-01-01 500
## 12 B 2023-01-01 2 2022-01-01 500
## 13 B 2023-01-01 2 2022-10-05 1500
We got the results, but the results/rates are obtained fixed on all earlier dates. To solve this issue, we have rolling joins. See next section.
15.4.2 Rolling Joins
Rolling joins are a variant of inequality joins that limit the results returned from an inequality join condition. They are useful for “rolling” the closest match forward/backwards when there isn’t an exact match. To construct a rolling join, we have to wrap inequality condition with closest()
.
Above example solved by wrapping date condition in closest()
## product.id sales.date quantity revision.date rate
## 1 A 2022-01-01 5 <NA> NA
## 2 A 2022-02-02 10 2022-02-01 50
## 3 A 2022-11-05 15 2022-10-05 70
## 4 A 2022-04-05 5 2022-04-03 60
## 5 A 2022-10-05 15 2022-10-05 70
## 6 B 2022-02-01 1 2022-01-01 500
## 7 B 2023-01-01 2 2022-10-05 1500
We may see that we do not have rates for product A
for sales made on 01-01-2022
.
15.4.3 Overlap Joins
Overlap joins are a special case of inequality joins involving one or two columns from the left-hand table overlapping a range defined by two columns from the right-hand table. There are three helpers that join_by()
recognizes to assist with constructing overlap joins, all of which can be constructed from simpler inequalities.
between(x, y_lower, y_upper, ..., bounds = "[]")
which is just short forx >= y_lower, x <= y_upper
within(x_lower, x_upper, y_lower, y_upper)
which is just short forx_lower >= y_lower, x_upper <= y_upper
overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
which is short forx_lower <= y_upper, x_upper >= y_lower
We can make use of these functions as per our need.
Example-2: Suppose we have orders
data where we have customer-wise orders, corresponsing order dates and ship dates. Let us suppose we want to know about customers who placed another order, without waiting for shipping of any of their previous/earlier order.
orders <- structure(list(customer_id = c("A", "A", "A", "B", "B", "C",
"C"), order_id = 1:7, order_date = c("2019-01-01", "2019-01-05",
"2019-01-16", "2019-01-05", "2019-01-06", "2019-01-07", "2019-01-09"
), ship_date = c("2019-01-30", "2019-02-10", "2019-01-18", "2019-01-08",
"2019-01-08", "2019-01-08", "2019-01-10")), row.names = c(NA,
-7L), class = "data.frame")
# print
orders
## customer_id order_id order_date ship_date
## 1 A 1 2019-01-01 2019-01-30
## 2 A 2 2019-01-05 2019-02-10
## 3 A 3 2019-01-16 2019-01-18
## 4 B 4 2019-01-05 2019-01-08
## 5 B 5 2019-01-06 2019-01-08
## 6 C 6 2019-01-07 2019-01-08
## 7 C 7 2019-01-09 2019-01-10
# customers who placed the orders without
# waiting to ship their earlier order
orders %>%
inner_join(orders,
by = join_by(customer_id,
overlaps(order_date, ship_date, order_date, ship_date),
order_id < order_id))
## customer_id order_id.x order_date.x ship_date.x order_id.y order_date.y
## 1 A 1 2019-01-01 2019-01-30 2 2019-01-05
## 2 A 1 2019-01-01 2019-01-30 3 2019-01-16
## 3 A 2 2019-01-05 2019-02-10 3 2019-01-16
## 4 B 4 2019-01-05 2019-01-08 5 2019-01-06
## ship_date.y
## 1 2019-02-10
## 2 2019-01-18
## 3 2019-01-18
## 4 2019-01-08
15.4.5 Cross Joins
This join matches everything with everything and therefore can be thought of as a cross product of two data frames. We can use cross_join
for this special join. If we have m
and n
rows in two data frames, then as a result of cross join we will have m*n
rows.
Example-
df <- data.frame(products = c('A', 'B', 'C'))
df %>%
cross_join(df)
## products.x products.y
## 1 A A
## 2 A B
## 3 A C
## 4 B A
## 5 B B
## 6 B C
## 7 C A
## 8 C B
## 9 C C