15 Combining Tables/tabular data

Most of times, joining two or more tables will be required to perform analytics

Figure 15.1: Most of times, joining two or more tables will be required to perform analytics

In real world scenarios, there may hardly be a case when we have to analyse one single table. There may be cases when we have to either join tables split into multiple smaller tables (e.g. we can have smaller tables split States-wise), or the tables may be divided into various smaller master and transaction tables (relational databases).

We may thus divide the data tables joining requirements into three broad categories-

  • Simple joins or concatenation
  • Relational Joins
  • Filtering Joins

Let us discuss each of these with examples.

15.1 Simple joins/concatenation

Many times tables split into smaller tables have to be joined back before proceeding further for data analytics. We may have to join two or more tables either columnwise (e.g. some of the features for all rows have been split into a separate table) or row wise (e.g. all the fields/columns are split into smaller tables like a separate table for each State). Diagramatically these joins may be depicted as shown in figure 15.2.

Illustration of Simple joins/concatenation

Figure 15.2: Illustration of Simple joins/concatenation

15.1.1 Column binding

As we have already seen that data frames act like matrices in many ways except to that fact that these support heterogeneous data unlike matrices. We have also discussed the ways two matrices can be joined. Base R has two dedicated functions i.e. cbind() and rbind() for these operations.

In tidyverse (dplyr specifically) we have two similar functions bind_cols() and bind_rows respectively which provide us better functionality for these use cases. The syntax for finction bind_cols() used for concatenating two or more tables column wise is -

bind_cols(
  ...,
  .name_repair = c("unique", "universal", "check_unique")
)

Where -

  • ... represent data frames to be combined
  • .name_repair argument chooses method to rename duplicate column names, if any.

Example-1:

df1 <- iris[1:3, c(1,2,5)]
df2 <- iris[1:3, 3:5]

bind_cols(df1, df2, .name_repair = 'universal')
## New names:
## • `Species` -> `Species...3`
## • `Species` -> `Species...6`
##   Sepal.Length Sepal.Width Species...3 Petal.Length Petal.Width Species...6
## 1          5.1         3.5      setosa          1.4         0.2      setosa
## 2          4.9         3.0      setosa          1.4         0.2      setosa
## 3          4.7         3.2      setosa          1.3         0.2      setosa

Note: The data frames to be merged should be row-consistent for column binding. Try this bind_cols(iris[1:3, 1:2], iris[1:4, 3:4]) and see the results.

15.1.2 Row binding

The syntax used for appending rows of one or more tables together in one table is -

bind_rows(..., .id = NULL)

where -

  • ... represent data frames to be combined
  • .id argument creates a new column of identifiers.

To understand it better, let us see this example

setosa <- iris[1:3, 1:4]
versicolor <- iris[51:53, 1:4]
virginica <- iris[101:103, 1:4]

bind_rows(setosa, versicolor, virginica, .id = 'groups')
##   groups Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1      1          5.1         3.5          1.4         0.2
## 2      1          4.9         3.0          1.4         0.2
## 3      1          4.7         3.2          1.3         0.2
## 4      2          7.0         3.2          4.7         1.4
## 5      2          6.4         3.2          4.5         1.5
## 6      2          6.9         3.1          4.9         1.5
## 7      3          6.3         3.3          6.0         2.5
## 8      3          5.8         2.7          5.1         1.9
## 9      3          7.1         3.0          5.9         2.1

Note: In the above example if the requirement is to store data tables names into the new identifier column just list convert the databases into a list and .id will take element_names as the values in identifiers. Try this bind_rows(list(setosa=setosa, versicolor=versicolor, virginica=virginica), .id = 'Species')

15.2 Relational joins

Relational Joins are usually needed to join multiple tables based on primary key and secondary keys. The joins may either be one-to-one key join or one-to-many key joins. Broadly these can either be inner joins or outer joins. Diagrammatically these may be represented as shown in figure 15.3 wherein it may be noticed that additional columns from right data frame gets appended to left data frame. For this reason, these four joins are also known as mutating joins.

Illustration of Mutating Joins in dplyr

Figure 15.3: Illustration of Mutating Joins in dplyr

The syntax of all these joins is nearly same-

*_join(x, y, by = NULL, copy = FALSE, suffix = c('.x', '.y'), ... , keep = FALSE)

where -

  • x and y are data frames to be joined
  • by is a character vector of column names to be joined by
  • suffix argument provides suffixes to be added to column names, if any of those are duplicate
  • keep argument decides whether the join keys from both x and y be preserved in the output?

Let’s discuss each of these joins individually.

15.2.1 Inner Joins

Inner Joins keeps only those rows where matching keys are present in both the data frames.

Example-1:

band_members
## # A tibble: 3 × 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 × 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
inner_join(band_members, band_instruments)
## Joining with `by = join_by(name)`
## # A tibble: 2 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass

15.2.2 Left Joins

Left Joins on the other hand preserves all rows of data frame passed as x i.e. first argument irrespective of the fact that matching key record is available in second data table or not.

Example-1:

left_join(band_members, band_instruments)
## Joining with `by = join_by(name)`
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

15.2.3 Right Joins

Right join, is similar to left join and preserves all rows of data frame passed as y i.e. second argument irrespective of the fact that matching key record is available in first data table or not.

Example-1:

right_join(band_members, band_instruments)
## Joining with `by = join_by(name)`
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass  
## 3 Keith <NA>    guitar

15.2.4 Full Joins

Full join returns all the rows of both the data tables despite non-availability of matching key in either of the tables.

Example-1:

full_join(band_members, band_instruments)
## Joining with `by = join_by(name)`
## # A tibble: 4 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

You must have noticed that each of the examples shown above has thrown a warning that join has been performed on variable name. We may override this warning by specifically providing the joining key column name(s) in by argument i.e. by = "name".

There may be cases when the joining key column(s) in the two data frames are of different names. These cases can also be handled by using by argument.

Example-

band_instruments2
## # A tibble: 3 × 2
##   artist plays 
##   <chr>  <chr> 
## 1 John   guitar
## 2 Paul   bass  
## 3 Keith  guitar
left_join(band_members, band_instruments2, by = c('name' = 'artist'))
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

Note: Each of the *_join() can be joined on multiple keys/columns (i.e. more than one) using by argument as explained above.

Many-to-many Joins

Example of four of the above-mentioned joins are many to many joins and thus need to be used carefully. See the following example

df1 <- data.frame(
  x = c(1, 1, NA),
  y = c(11, 12, 21)
)
df1
##    x  y
## 1  1 11
## 2  1 12
## 3 NA 21
df2 <- data.frame(
  x = c(1, 1, NA, NA),
  z = c(101, 102, 201, 202)
)
df2
##    x   z
## 1  1 101
## 2  1 102
## 3 NA 201
## 4 NA 202
df1 %>% left_join(df2, by = 'x')
## Warning in left_join(., df2, by = "x"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
##    x  y   z
## 1  1 11 101
## 2  1 11 102
## 3  1 12 101
## 4  1 12 102
## 5 NA 21 201
## 6 NA 21 202

In fact, we can use argument relationship explicitly to silence this warning and avoid errors. This argument can take one of these values

  • NULL (default)
  • "one-to-one"
  • "one-to-many"
  • "many-to-one"
  • "many-to-many"

Example above re-coded

df1 %>% left_join(df2, by = 'x', relationship = "many-to-many")
##    x  y   z
## 1  1 11 101
## 2  1 11 102
## 3  1 12 101
## 4  1 12 102
## 5 NA 21 201
## 6 NA 21 202

15.3 Filtering Joins

The other joins available in dplyr are basically filtering joins. These joins will not result in appending additional columns present in right data frame to left data frame. Only the common keys (columns) from right data frame will be used to filter left data frame.

15.3.1 Semi Joins

First of these is semi_join which essentially filters those rows from a data frame, which are based on another data frame. See this example-

df1 <- data.frame(
  x = c(1, 2, NA),
  y = c(11, 21, 100)
)
df1
##    x   y
## 1  1  11
## 2  2  21
## 3 NA 100
df2 <- data.frame(
  x = c(1, 3, 4, NA),
  z = c(101, 301, 401, 501)
)
df2
##    x   z
## 1  1 101
## 2  3 301
## 3  4 401
## 4 NA 501
df1 %>% semi_join(df2, by = 'x')
##    x   y
## 1  1  11
## 2 NA 100

15.3.2 Anti Joins

The anti_join is basically opposite to that of semi_join. It keeps only those records from left data-frame which are not available in right data-frame. Example

df1 %>% anti_join(df2, by = 'x')
##   x  y
## 1 2 21

15.4 Relational, but non-equi joins

Till now, we have learnt joins that use syntax of dplyr version before 1.1.0. However, in dplyr newest version, 1.1.0, released in January 2023, a new helper function join_by() has been introduced. It actually, constructs a specification that describes how to join two tables using a small domain specific language. The result can be supplied as the by argument to any of the join functions that we have learnt above. For equal joins, we have learnt above, there are two slight but easier to implement changes if we use join_by() -

  • we may pass variable names in by argument instead of using characters as described above.
  • instead of using = we have to use == which is actually the equality operator as opposed to assignment operator we were using earlier.

So our syntax for left_join example is

band_instruments2
## # A tibble: 3 × 2
##   artist plays 
##   <chr>  <chr> 
## 1 John   guitar
## 2 Paul   bass  
## 3 Keith  guitar
left_join(band_members, band_instruments2, by = join_by(name == artist))
## # A tibble: 3 × 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

Simple, isn’t it. But now using the potential of this new helper function, we can on the other hand join two (or more data frames) using inequality conditions.

15.4.1 Inequality Joins

Inequality joins match on an inequality, such as >, >=, <, or <=, and are common in time series analysis and genomics. To construct an inequality join using join_by(), we have to supply two column names separated by one of the above mentioned inequalities. See the following example for better understanding of inequality joins.

Example- Suppose we have two tables, one containing product-wise sales and another containing rates as and when these are revised. So, in order to have total sales amount, we can use inequality join.

# sales
sales <- data.frame(
  product.id = c("A", "A", "A", "A", "A", "B", "B"),
  sales.date = as.Date(c("01-01-2022",
                 "02-02-2022","05-11-2022","05-04-2022",
                 "05-10-2022","01-02-2022","01-01-2023"), format = "%d-%m-%Y"),
  quantity = c(5L, 10L, 15L, 5L, 15L, 1L, 2L)
)
# Print it
sales
##   product.id sales.date quantity
## 1          A 2022-01-01        5
## 2          A 2022-02-02       10
## 3          A 2022-11-05       15
## 4          A 2022-04-05        5
## 5          A 2022-10-05       15
## 6          B 2022-02-01        1
## 7          B 2023-01-01        2
# rates
rates <- data.frame(
  product.id = c("A", "A", "A", "B", "B"),
  revision.date = as.Date(c("01-02-2022",
                    "03-04-2022","05-10-2022","01-01-2022",
                    "05-10-2022"), format = "%d-%m-%Y"),
  rate = c(50L, 60L, 70L, 500L, 1500L)
)
#print it
rates
##   product.id revision.date rate
## 1          A    2022-02-01   50
## 2          A    2022-04-03   60
## 3          A    2022-10-05   70
## 4          B    2022-01-01  500
## 5          B    2022-10-05 1500
# total sale(1st attempt)
sales %>% 
  left_join(rates,
            by = join_by(product.id,
                         sales.date >= revision.date))
##    product.id sales.date quantity revision.date rate
## 1           A 2022-01-01        5          <NA>   NA
## 2           A 2022-02-02       10    2022-02-01   50
## 3           A 2022-11-05       15    2022-02-01   50
## 4           A 2022-11-05       15    2022-04-03   60
## 5           A 2022-11-05       15    2022-10-05   70
## 6           A 2022-04-05        5    2022-02-01   50
## 7           A 2022-04-05        5    2022-04-03   60
## 8           A 2022-10-05       15    2022-02-01   50
## 9           A 2022-10-05       15    2022-04-03   60
## 10          A 2022-10-05       15    2022-10-05   70
## 11          B 2022-02-01        1    2022-01-01  500
## 12          B 2023-01-01        2    2022-01-01  500
## 13          B 2023-01-01        2    2022-10-05 1500

We got the results, but the results/rates are obtained fixed on all earlier dates. To solve this issue, we have rolling joins. See next section.

15.4.2 Rolling Joins

Rolling joins are a variant of inequality joins that limit the results returned from an inequality join condition. They are useful for “rolling” the closest match forward/backwards when there isn’t an exact match. To construct a rolling join, we have to wrap inequality condition with closest().

Above example solved by wrapping date condition in closest()

sales %>% 
  left_join(rates,
            by = join_by(product.id,
                         closest(sales.date >= revision.date)))
##   product.id sales.date quantity revision.date rate
## 1          A 2022-01-01        5          <NA>   NA
## 2          A 2022-02-02       10    2022-02-01   50
## 3          A 2022-11-05       15    2022-10-05   70
## 4          A 2022-04-05        5    2022-04-03   60
## 5          A 2022-10-05       15    2022-10-05   70
## 6          B 2022-02-01        1    2022-01-01  500
## 7          B 2023-01-01        2    2022-10-05 1500

We may see that we do not have rates for product A for sales made on 01-01-2022.

15.4.3 Overlap Joins

Overlap joins are a special case of inequality joins involving one or two columns from the left-hand table overlapping a range defined by two columns from the right-hand table. There are three helpers that join_by() recognizes to assist with constructing overlap joins, all of which can be constructed from simpler inequalities.

  • between(x, y_lower, y_upper, ..., bounds = "[]") which is just short for x >= y_lower, x <= y_upper

  • within(x_lower, x_upper, y_lower, y_upper) which is just short for x_lower >= y_lower, x_upper <= y_upper

  • overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]") which is short for x_lower <= y_upper, x_upper >= y_lower

We can make use of these functions as per our need.

Example-2: Suppose we have orders data where we have customer-wise orders, corresponsing order dates and ship dates. Let us suppose we want to know about customers who placed another order, without waiting for shipping of any of their previous/earlier order.

orders <- structure(list(customer_id = c("A", "A", "A", "B", "B", "C", 
"C"), order_id = 1:7, order_date = c("2019-01-01", "2019-01-05", 
"2019-01-16", "2019-01-05", "2019-01-06", "2019-01-07", "2019-01-09"
), ship_date = c("2019-01-30", "2019-02-10", "2019-01-18", "2019-01-08", 
"2019-01-08", "2019-01-08", "2019-01-10")), row.names = c(NA, 
-7L), class = "data.frame")

# print 
orders
##   customer_id order_id order_date  ship_date
## 1           A        1 2019-01-01 2019-01-30
## 2           A        2 2019-01-05 2019-02-10
## 3           A        3 2019-01-16 2019-01-18
## 4           B        4 2019-01-05 2019-01-08
## 5           B        5 2019-01-06 2019-01-08
## 6           C        6 2019-01-07 2019-01-08
## 7           C        7 2019-01-09 2019-01-10
# customers who placed the orders without
# waiting to ship their earlier order
orders %>% 
  inner_join(orders,
             by = join_by(customer_id,
                          overlaps(order_date, ship_date, order_date, ship_date),
                          order_id < order_id)) 
##   customer_id order_id.x order_date.x ship_date.x order_id.y order_date.y
## 1           A          1   2019-01-01  2019-01-30          2   2019-01-05
## 2           A          1   2019-01-01  2019-01-30          3   2019-01-16
## 3           A          2   2019-01-05  2019-02-10          3   2019-01-16
## 4           B          4   2019-01-05  2019-01-08          5   2019-01-06
##   ship_date.y
## 1  2019-02-10
## 2  2019-01-18
## 3  2019-01-18
## 4  2019-01-08

15.4.4 Example uses of rolling joins.

The content is under development.

15.4.5 Cross Joins

This join matches everything with everything and therefore can be thought of as a cross product of two data frames. We can use cross_join for this special join. If we have m and n rows in two data frames, then as a result of cross join we will have m*n rows.

Example-

df <- data.frame(products = c('A', 'B', 'C'))

df %>% 
  cross_join(df)
##   products.x products.y
## 1          A          A
## 2          A          B
## 3          A          C
## 4          B          A
## 5          B          B
## 6          B          C
## 7          C          A
## 8          C          B
## 9          C          C