11 Merging large number of similar datasets into one

Data preparation for performing analytics is an important task and may require more time than actual analytics because we rarely have data in ideal format. Importing csv or flat files is rather an easy job. However, considering large popularity of MS Excel, we at times have our data saved in excel files.

Sometimes, one single data frame/table is divided into multiple sheets in one excel file whereas sometimes these tables are divided in multiple files. Here we are discussing few of these cases, where we can reduce our data preparation time by effectively writing the code for import of such data into our environment.

11.1 Case-1: Merging multiple excel sheets into one data frame

As an example, let’s say we have multiple States’ data saved on a different sheet in one excel file say Daur.xlsx. See preview in fig 11.1.

Preview of example excel file

Figure 11.1: Preview of example excel file

We will use library readxl to read excel files. This library is bundled with tidyverse but is not part of core tidyverse, so it has to be loaded explicitly, though explicit download is not required if tidyverse is installed in the system.

The following steps are used-

Step-1: Read path

Step-2: Collect Names of all sheets

Step-3: Set names of elements of above vector onto itself

Step-4: Read and combine all tables into one. We will use purrr::map_dfr for this.

# Step-1
path <- "data/daur1.xlsx"
# Step-2
states_data <- excel_sheets(path) %>% 
# step-3
    set_names(., .) %>% 
# step-4
    map_dfr(read_excel, path=path, .id = 'State_name')
# print file
states_data
## # A tibble: 18 × 4
##    State_name     Year    Metric_1 Metric_2
##    <chr>          <chr>      <dbl>    <dbl>
##  1 Andhra Pradesh 2015-16      119      121
##  2 Andhra Pradesh 2016-17      114      134
##  3 Andhra Pradesh 2017-18      115      122
##  4 Andhra Pradesh 2018-19      129      137
##  5 Andhra Pradesh 2019-20      149      129
##  6 Andhra Pradesh 2020-21      104      124
##  7 Assam          2015-16      104      145
##  8 Assam          2016-17      114      144
##  9 Assam          2017-18      116      116
## 10 Assam          2018-19      129      130
## 11 Assam          2019-20      144      134
## 12 Assam          2020-21      124      116
## 13 Bihar          2015-16      109      148
## 14 Bihar          2016-17      134      106
## 15 Bihar          2017-18      131      133
## 16 Bihar          2018-19      131      100
## 17 Bihar          2019-20      131      127
## 18 Bihar          2020-21      128      103

We can see that data from all the sheets have been merged into table and one extra column has been created using sheet name. The name of that column has been provided through .id argument. If the new column is not required, simply don’t use this argument.

11.2 Case-2: Merging multiple files into one data frame

Often we have our source data split into multiple files, which we will have to compile in one single data frame before proceeding In this case, we may collect all such files in one directory and follow these steps-

Step-1: Store all file names using list.files()

Step-2: We may read all files in one list using either lapply or purrr::map.

Step-3: If data structures in all the files are same, we can directly use purrr::map_dfr which will read all files and give us a data frame. If however, the structure of data in all files are not same, we may convert all columns into character type before merging these files. We can thereafter proceed for merging all data using either purrr::map_dfr or lapply in combination with do.call.

11.3 Case-3: Split and save one data frame into multiple excel/csv files simultaneously.

As an example will use states_data created in case-1. We can use the following algorithm

Step-1: Create a vector of file names using paste0 Step-2: Split data frame into a list with separate dataframe for each state Step-3: Write to a separate file using purrr::walk2()

The complete algorithm is

# step-1 : create a vector of file names (output)
file_names <- paste0("data/", unique(states_data$State_name), ".csv")

states_data %>% 
  group_split(State_name) %>% 
  purrr::walk2(file_names, write.csv)

We can check that 3 new files with state_names as filenames have been created in the data folder/directory as desired.

11.4 Case-4: Splitting one data into muliple files having multiple sheets

Sometimes, we may require to split a file not only into multiple files, but simultaeously require to split each file into multiple excel sheets. E.g. A data having States and districts is to be split into State-wise files having a separate sheet for each district.

This can be achieved using writexl library. In this case, we may write a custom function which can do our job easily.

library(tidyverse)
library(writexl)

book_and_sheets <- function(df, x, y){
  df_by_x <- df %>% 
    split(.[[x]])
  
  save_to_excel <- function(a, b){
    a %>% 
      split(.[[y]]) %>% 
      writexl::write_xlsx(
        path = paste0("data/data_by_", b,"_",x, ".xlsx")
      )
    
  }
  
  imap(df_by_x, save_to_excel)
}

The function book_and_sheets designed in above code helps us to write a data say df into separate files based on column x and each of these files is further divided into sheets based on column y. Only thing to be remembered is that we have to pass, x and y arguments as character strings; and df as variable.

Example - splitting mtcars into files based on cyl and sheets based on gear

book_and_sheets(mtcars, 'cyl', 'gear')
## $`4`
## [1] "G:\\OneDrive - A c GeM CAG of INDIA (1)\\new book\\Data\\data_by_4_cyl.xlsx"
## 
## $`6`
## [1] "G:\\OneDrive - A c GeM CAG of INDIA (1)\\new book\\Data\\data_by_6_cyl.xlsx"
## 
## $`8`
## [1] "G:\\OneDrive - A c GeM CAG of INDIA (1)\\new book\\Data\\data_by_8_cyl.xlsx"

We can check that three excel files have been created in directory data/.