10 Data Cleaning in R
Data cleansing is one of the important steps in data analysis. Multiple packages are available in r to clean the data sets. One of such packages is janitor
which we will be using in this chapter along with few other packages.
Let’s load it
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
10.1 Cleaning Column names.
We know that names of objects in R follow certain conventions like we may not have certain special characters in names. If a space has been used that is to be quoted under a pair of backticks `. But generally when we read data from files in excel, we can have some ‘dirty’ names, which we should clean before proceeding. In such clean_names()
come handy. E.g.
# Create a data.frame with dirty names
test_df <- as.data.frame(matrix(ncol = 6))
names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)",
"REPEAT VALUE", "REPEAT VALUE", "")
# View this data
test_df
## firstName ábc@!* % successful (2009) REPEAT VALUE REPEAT VALUE
## 1 NA NA NA NA NA NA
Using clean_names()
which is also pipe friendly, we can clean names in one step. (Results will be in snake case)
test_df %>%
clean_names()
## first_name abc percent_successful_2009 repeat_value repeat_value_2 x
## 1 NA NA NA NA NA NA
It -
- Parses letter cases and separators to a consistent format.
- Default is to
snake_case
, but other cases likecamelCase
are available - Handles special characters and spaces, including transliterating characters like
œ
tooe
. - Appends numbers to duplicated names
- Converts
“%”
to “percent” and“#”
to“number”
to retain meaning - Spacing (or lack thereof) around numbers is preserved
10.2 Handling duplicate records
In janitor
package, we have a ready to use function get_dupes()
. It allows us to find “similar” observations in a data set based on certain characteristics. Syntax is pretty simple, and function is pipe friendly too. Suppose we have to find out duplicate in mtcars
dataset on each combination of wt
and cyl
.
## wt cyl dupe_count mpg disp hp drat qsec vs am gear carb
## 1 3.44 6 2 19.2 167.6 123 3.92 18.30 1 0 4 4
## 2 3.44 6 2 17.8 167.6 123 3.92 18.90 1 0 4 4
## 3 3.57 8 2 14.3 360.0 245 3.21 15.84 0 0 3 4
## 4 3.57 8 2 15.0 301.0 335 3.54 14.60 0 1 5 8
We can see that it returns all duplicate records with an additional column dupe_count
so that these duplicates can be analysed separately.
10.3 Remove Constant (Redundant) columns
Dropping columns from a data.frame
that contain only a single constant value throughout is again easy through janitor::remove_constant()
.
10.4 Remove empty rows and/or columns
While importing messy data from excel files, we may get some empty rows and/or columns. Sorting out this issue, is easy using janitor::remove_empty()
.
10.5 Fix excel dates stored as serial numbers
While loading excel files in R, we may have sometimes noticed 41590
instead of having a date format
. Sorting out this issue is again easy in janitor
as we have a function excel_numeric_to_date()
for this. Example
janitor::excel_numeric_to_date(41590)
## [1] "2013-11-12"
10.6 Convert a mix of date and datetime formats to date
Similar to above, we can also sort out, if we have a column mix of different date formats, using janitor::convert_to_date()
or janitor::convert_to_datetime()
. See Examples-
unsorted_dates <- c('2018-05-31', '41590', 41590)
janitor::convert_to_date(unsorted_dates)
## [1] "2018-05-31" "2013-11-12" "2013-11-12"
Note in above example, we have created a heterogeneous vector, but implicit coercion rules of R have converted all forms to character only.
In real world examples, where data is entered through multiple machines/data points simultaneously, we may a column mix of date formats. In that case, we may use parse_date_time()
function in lubridate
package. To allow different formats we have use order
agument in this function. Example
mixed_dates <- c("13-11-1991", "13-Sep-22",
"20 August 2000", "15 August 87",
"03/31/23", "12-31-2022")
lubridate::parse_date_time(mixed_dates,
orders = c("d m y", "d B Y", "m/d/y", "d B y"),
locale = "eng")
## [1] "1991-11-13 UTC" "2022-09-13 UTC" "2000-08-20 UTC" "1987-08-15 UTC"
## [5] "2023-03-31 UTC" "2022-12-31 UTC"