10 Data Cleaning in R

Data cleansing is one of the important steps in data analysis. Multiple packages are available in r to clean the data sets. One of such packages is janitor which we will be using in this chapter along with few other packages.

Let’s load it

## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

10.1 Cleaning Column names.

We know that names of objects in R follow certain conventions like we may not have certain special characters in names. If a space has been used that is to be quoted under a pair of backticks `. But generally when we read data from files in excel, we can have some ‘dirty’ names, which we should clean before proceeding. In such clean_names() come handy. E.g.

# Create a data.frame with dirty names
test_df <- as.data.frame(matrix(ncol = 6))

names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)",
                    "REPEAT VALUE", "REPEAT VALUE", "")
# View this data
test_df
##   firstName ábc@!* % successful (2009) REPEAT VALUE REPEAT VALUE   
## 1        NA     NA                  NA           NA           NA NA

Using clean_names() which is also pipe friendly, we can clean names in one step. (Results will be in snake case)

test_df %>% 
  clean_names()
##   first_name abc percent_successful_2009 repeat_value repeat_value_2  x
## 1         NA  NA                      NA           NA             NA NA

It -

  • Parses letter cases and separators to a consistent format.
  • Default is to snake_case, but other cases like camelCase are available
  • Handles special characters and spaces, including transliterating characters like œ to oe.
  • Appends numbers to duplicated names
  • Converts “%” to “percent” and “#” to “number” to retain meaning
  • Spacing (or lack thereof) around numbers is preserved

10.2 Handling duplicate records

In janitor package, we have a ready to use function get_dupes(). It allows us to find “similar” observations in a data set based on certain characteristics. Syntax is pretty simple, and function is pipe friendly too. Suppose we have to find out duplicate in mtcars dataset on each combination of wt and cyl.

mtcars %>% 
  get_dupes(wt, cyl)
##     wt cyl dupe_count  mpg  disp  hp drat  qsec vs am gear carb
## 1 3.44   6          2 19.2 167.6 123 3.92 18.30  1  0    4    4
## 2 3.44   6          2 17.8 167.6 123 3.92 18.90  1  0    4    4
## 3 3.57   8          2 14.3 360.0 245 3.21 15.84  0  0    3    4
## 4 3.57   8          2 15.0 301.0 335 3.54 14.60  0  1    5    8

We can see that it returns all duplicate records with an additional column dupe_count so that these duplicates can be analysed separately.

10.3 Remove Constant (Redundant) columns

Dropping columns from a data.frame that contain only a single constant value throughout is again easy through janitor::remove_constant().

10.4 Remove empty rows and/or columns

While importing messy data from excel files, we may get some empty rows and/or columns. Sorting out this issue, is easy using janitor::remove_empty().

10.5 Fix excel dates stored as serial numbers

While loading excel files in R, we may have sometimes noticed 41590 instead of having a date format. Sorting out this issue is again easy in janitor as we have a function excel_numeric_to_date() for this. Example

janitor::excel_numeric_to_date(41590)
## [1] "2013-11-12"

10.6 Convert a mix of date and datetime formats to date

Similar to above, we can also sort out, if we have a column mix of different date formats, using janitor::convert_to_date() or janitor::convert_to_datetime(). See Examples-

unsorted_dates <- c('2018-05-31', '41590', 41590)
janitor::convert_to_date(unsorted_dates)
## [1] "2018-05-31" "2013-11-12" "2013-11-12"

Note in above example, we have created a heterogeneous vector, but implicit coercion rules of R have converted all forms to character only.

In real world examples, where data is entered through multiple machines/data points simultaneously, we may a column mix of date formats. In that case, we may use parse_date_time() function in lubridate package. To allow different formats we have use order agument in this function. Example

mixed_dates <- c("13-11-1991", "13-Sep-22", 
                 "20 August 2000", "15 August 87", 
                 "03/31/23", "12-31-2022")

lubridate::parse_date_time(mixed_dates,
                           orders = c("d m y", "d B Y", "m/d/y", "d B y"),
                           locale = "eng")
## [1] "1991-11-13 UTC" "2022-09-13 UTC" "2000-08-20 UTC" "1987-08-15 UTC"
## [5] "2023-03-31 UTC" "2022-12-31 UTC"