14 Data Transformation in dplyr
Prerequisites
Obviously dplyr
22 will be needed. This package also comes with matrittr pipe i.e. %>%
and therefore in dplyr syntax we will be using these pipes. library(magrittr)
is not needed.
The package dplyr
(14.1) calls its functions as ‘verbs’ because these are actually doing some action. So dplyr verbs
can be divided in three classifications depending upon where they operate -
- ‘Row’ verbs that operate on Rows
- ‘Column’ verbs
- ‘group’ verbs that operate on table split into different groups.

Figure 14.1: Package Dplyr
Let’s learn each of these -
14.1 Column verb select()
In real world data sets we will often come across with data frames having numerous columns. However for many of the data analysis tasks, most of these columns are not needed. As already stated select
(figure 14.2) operates on columns. Like SELECT
in SQL
, it just select the column(s) from the given data frame. The basic syntax is - select(data_frame, column_names, ...)
. So with pipes the same syntax goes like this
data_frame %>%
select(column_name)

Figure 14.2: Illustration of dplyr::select()
For example, let’s try our hands on mtcars
dataset.
Example-1
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
## Lincoln Continental 10.4
## Chrysler Imperial 14.7
## Fiat 128 32.4
## Honda Civic 30.4
## Toyota Corolla 33.9
## Toyota Corona 21.5
## Dodge Challenger 15.5
## AMC Javelin 15.2
## Camaro Z28 13.3
## Pontiac Firebird 19.2
## Fiat X1-9 27.3
## Porsche 914-2 26.0
## Lotus Europa 30.4
## Ford Pantera L 15.8
## Ferrari Dino 19.7
## Maserati Bora 15.0
## Volvo 142E 21.4
Note that output is still a data frame unlike the mtcars[['mpg']]
which returns a vector. We can subset multiple columns here. Example-2
## mpg qsec
## Mazda RX4 21.0 16.46
## Mazda RX4 Wag 21.0 17.02
## Datsun 710 22.8 18.61
## Hornet 4 Drive 21.4 19.44
## Hornet Sportabout 18.7 17.02
## Valiant 18.1 20.22
We can also provide column numbers instead of names. Example-3
## hp wt
## Porsche 914-2 91 2.140
## Lotus Europa 113 1.513
## Ford Pantera L 264 3.170
## Ferrari Dino 175 2.770
## Maserati Bora 335 3.570
## Volvo 142E 109 2.780
We can also use select
to reorder the columns in output, by using a dplyr
helping verb everything()
which is basically everything else. See this example-
mtcars %>%
select(qsec, mpg, everything()) %>%
names()
## [1] "qsec" "mpg" "cyl" "disp" "hp" "drat" "wt" "vs" "am" "gear"
## [11] "carb"
We may also use mix and match of column names and column numbers. See
mtcars %>%
select(5, 7, mpg, everything()) %>%
names()
## [1] "drat" "qsec" "mpg" "cyl" "disp" "hp" "wt" "vs" "am" "gear"
## [11] "carb"
Operator :
can also be used with column names. Ex-
## mpg cyl disp hp drat
## Mazda RX4 21.0 6 160 110 3.90
## Mazda RX4 Wag 21.0 6 160 110 3.90
## Datsun 710 22.8 4 108 93 3.85
## Hornet 4 Drive 21.4 6 258 110 3.08
Other helping verbs
There are other helping verbs, apart from everything()
that can be used within select()
just to eliminate need to type the column names and select columns based on some conditions. These verbs are self explanatory-
-
starts_with('ABC')
will select all columns the names of which starts with stringABC
-
ends_with('ABC')
will select all columns the names of which ends with stringABC
-
contains('ABC')
will select all columns the names of which contains stringABC
-
num_range('A', 1:3)
will select all columns namedA1
,A2
andA3
Some Examples.
- This syntax will select all columns which ends with
"color"
in column names.
## # A tibble: 87 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ℹ 77 more rows
Example-2: This syntax will select all columns contains characters "or"
in column names.
## # A tibble: 87 × 4
## hair_color skin_color eye_color homeworld
## <chr> <chr> <chr> <chr>
## 1 blond fair blue Tatooine
## 2 <NA> gold yellow Tatooine
## 3 <NA> white, blue red Naboo
## 4 none white yellow Tatooine
## 5 brown light brown Alderaan
## 6 brown, grey light blue Tatooine
## 7 brown light blue Tatooine
## 8 <NA> white, red red Tatooine
## 9 black light brown Tatooine
## 10 auburn, white fair blue-gray Stewjon
## # ℹ 77 more rows
Using where()
: This selection helper selects the variables for which a function when applied to column names, returns TRUE
.
Example-3: This syntax will select all numeric columns from the dataset.
## # A tibble: 87 × 3
## height mass birth_year
## <int> <dbl> <dbl>
## 1 172 77 19
## 2 167 75 112
## 3 96 32 33
## 4 202 136 41.9
## 5 150 49 19
## 6 178 120 52
## 7 165 75 47
## 8 97 32 NA
## 9 183 84 24
## 10 182 77 57
## # ℹ 77 more rows
14.2 Column verbmutate()
This perhaps is one of the most important functions in dplyr
kitty. It enables us to create new column(s) that are functions of one or more existing columns. Refer figure 14.3

Figure 14.3: Illustration of dplyr::mutate()
More than one column can be added simultaneously. Newly created column may also be used for creation of another new column. See example.
## # A tibble: 87 × 5
## name height mass name_upper BMI
## <chr> <int> <dbl> <chr> <dbl>
## 1 Luke Skywalker 172 77 LUKE SKYWALKER 26.0
## 2 C-3PO 167 75 C-3PO 26.9
## 3 R2-D2 96 32 R2-D2 34.7
## 4 Darth Vader 202 136 DARTH VADER 33.3
## 5 Leia Organa 150 49 LEIA ORGANA 21.8
## 6 Owen Lars 178 120 OWEN LARS 37.9
## 7 Beru Whitesun Lars 165 75 BERU WHITESUN LARS 27.5
## 8 R5-D4 97 32 R5-D4 34.0
## 9 Biggs Darklighter 183 84 BIGGS DARKLIGHTER 25.1
## 10 Obi-Wan Kenobi 182 77 OBI-WAN KENOBI 23.2
## # ℹ 77 more rows
By default the new column will be added to the last of data frame. As shown in above example, more operations can be combined in one using %>%
. There is a cousin transmute()
of mutate which drops all the old columns and keeps only newly created columns. Example
## # A tibble: 87 × 1
## name_upper
## <chr>
## 1 LUKE SKYWALKER
## 2 C-3PO
## 3 R2-D2
## 4 DARTH VADER
## 5 LEIA ORGANA
## 6 OWEN LARS
## 7 BERU WHITESUN LARS
## 8 R5-D4
## 9 BIGGS DARKLIGHTER
## 10 OBI-WAN KENOBI
## # ℹ 77 more rows
Other useful dplyr functions Another good use of mutate
is to generate summarised result and display it corresponding to each row in data. For example if the requirement is to calculate proportion of say wt
column in mtcars
data.
## wt total_wt wt_proportion
## Mazda RX4 2.620 17.93 14.61238
## Mazda RX4 Wag 2.875 17.93 16.03458
## Datsun 710 2.320 17.93 12.93921
## Hornet 4 Drive 3.215 17.93 17.93084
## Hornet Sportabout 3.440 17.93 19.18572
## Valiant 3.460 17.93 19.29727
-
n()
is used to count number of rows -
n_distinct()
is used to count number of distinct values for the given variable
## mpg cyl disp hp drat total_cars
## Mazda RX4 21.0 6 160 110 3.90 32
## Mazda RX4 Wag 21.0 6 160 110 3.90 32
## Datsun 710 22.8 4 108 93 3.85 32
## Hornet 4 Drive 21.4 6 258 110 3.08 32
## Hornet Sportabout 18.7 8 360 175 3.15 32
## Valiant 18.1 6 225 105 2.76 32
14.3 Column verb rename()
It is used to rename the column names. Refer figure 14.4 for illustration.

Figure 14.4: Illustration of dplyr::rename()
See this example
## miles_per_gallon cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Note that select
can also rename the columns but will drop all unselected columns. Check this
## miles_per_gallon
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
14.4 Column verb relocate()
It relocates column or block of columns simultaneosly either before the column mentioned in argument .before
or after mentioned in .after
. See the example-
## # A tibble: 5 × 14
## name hair_color skin_color eye_color height mass birth_year sex gender
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <chr> <chr>
## 1 Luke Sky… blond fair blue 172 77 19 male mascu…
## 2 C-3PO <NA> gold yellow 167 75 112 none mascu…
## 3 R2-D2 <NA> white, bl… red 96 32 33 none mascu…
## 4 Darth Va… none white yellow 202 136 41.9 male mascu…
## 5 Leia Org… brown light brown 150 49 19 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
14.5 Row verb filter
This verb/function is used to subset the data, or in other words filter rows of data frame based on certain condition. Refer figure 14.5 for illustration.

Figure 14.5: Illustration of dplyr::filter()
See this example-
## # A tibble: 16 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 2 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 3 Darth V… 202 136 none white yellow 41.9 male mascu…
## 4 R5-D4 97 32 <NA> white, red red NA none mascu…
## 5 Palpati… 170 75 grey pale yellow 82 male mascu…
## 6 IG-88 200 140 none metal red 15 none mascu…
## 7 Bossk 190 113 none green red 53 male mascu…
## 8 Nute Gu… 191 90 none mottled g… red NA male mascu…
## 9 Watto 137 NA black blue, grey yellow NA male mascu…
## 10 Darth M… 175 80 none red yellow 54 male mascu…
## 11 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
## 12 Ki-Adi-… 198 82 white pale yellow 92 male mascu…
## 13 Yarael … 264 NA none white yellow NA male mascu…
## 14 Poggle … 183 80 none green yellow NA male mascu…
## 15 Zam Wes… 168 55 blonde fair, gre… yellow NA fema… femin…
## 16 Dexter … 198 102 none brown yellow NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Multiple conditions can be passed simultaneously. Example
## # A tibble: 2 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 2 Yarael P… 264 NA none white yellow NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Note that these conditions act simultaneously as in operator AND
is used. So if OR
is to be used, use |
explicitly
## [1] 71
14.6 Row verbs slice()
/ slice_*()
slice()
and its cousins also filters rows but based on rows placement. So, data_fr %>% slice(1:5)
will filter out first five rows of the data_fr
. See example
## # A tibble: 7 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 2 Leia Org… 150 49 brown light brown 19 fema… femin…
## 3 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 4 Beru Whi… 165 75 brown light blue 47 fema… femin…
## 5 R5-D4 97 32 <NA> white, red red NA none mascu…
## 6 Biggs Da… 183 84 black light brown 24 male mascu…
## 7 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Other slice()
cousins -
-
slice_head(5)
will slice out first 5 rows -
slice_tail(10)
will slice out last 10 rows -
slice_min()
orslice_max()
will slice rows with highest or lowest values of given variable. The full syntax isslice_max(.data, order_by, ..., n, prop, with_ties = TRUE)
or equivalent -
slice_sample()
will randomly select the rows. Its syntax isslice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)
Example-1:
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu…
## 2 Ratts Ty… 79 15 none grey, blue unknown NA male mascu…
## 3 Wicket S… 88 20 brown brown brown 8 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Example-2:
set.seed(2022)
starwars %>%
slice_sample(prop = 0.1) #sample 10% rows
## # A tibble: 8 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Ki-Adi-M… 198 82 white pale yellow 92 male mascu…
## 2 Grievous 216 159 none brown, wh… green, y… NA male mascu…
## 3 Saesee T… 188 NA none pale orange NA male mascu…
## 4 Wat Tamb… 193 48 none green, gr… unknown NA male mascu…
## 5 Jango Fe… 183 79 black tan brown 66 male mascu…
## 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 7 Luminara… 170 56.2 black yellow blue 58 fema… femin…
## 8 BB8 NA NA none none black NA none mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
14.7 Row verb arrange()
This verb also act upon rows and it actually rearranges them on the basis of some condition. Refer figure 14.6 for illustration.

Figure 14.6: Illustration of dplyr::arrange()
Example-
## # A tibble: 5 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu…
## 2 Ratts Ty… 79 15 none grey, blue unknown NA male mascu…
## 3 Wicket S… 88 20 brown brown brown 8 male mascu…
## 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
## 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
14.8 Group verb group_by()
A data analyst will be hard to find who is not using group_by
. It basically groups the rows on the basis of values of a given variable or block of variables. The returned result is still a data frame (and one too) but now the rows are grouped. Refer figure 14.7 for illustration. So any of the above functions we learnt above will give a different result after group by.

Figure 14.7: Illustration of Grouped Operations in dplyr
Note the output of this simple example
## # A tibble: 87 × 14
## # Groups: sex [5]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Note that output now has 5 groups, though nothing different is seen in the displayed data.
This operation/verb is thus more useful if used in combination with other verbs.
Example-1: How many total characters are with same skin_color?
## # A tibble: 87 × 3
## # Groups: skin_color [31]
## name skin_color total_with_s_c
## <chr> <chr> <int>
## 1 Luke Skywalker fair 17
## 2 C-3PO gold 1
## 3 R2-D2 white, blue 2
## 4 Darth Vader white 2
## 5 Leia Organa light 11
## 6 Owen Lars light 11
## 7 Beru Whitesun Lars light 11
## 8 R5-D4 white, red 1
## 9 Biggs Darklighter light 11
## 10 Obi-Wan Kenobi fair 17
## # ℹ 77 more rows
Example- 2: Sample 2 rows of each cyl
size from mtcars
?
set.seed(123)
mtcars %>%
group_by(cyl) %>%
slice_sample(n=2)
## # A tibble: 6 × 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
## 3 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 4 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 5 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 6 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
Also note that grouped
varaible(s) will always be available in the output.
mtcars %>%
group_by(cyl) %>%
select(drat) %>% # despite not selecting cyl
head() # it is available in output
## Adding missing grouping variables: `cyl`
## # A tibble: 6 × 2
## # Groups: cyl [3]
## cyl drat
## <dbl> <dbl>
## 1 6 3.9
## 2 6 3.9
## 3 4 3.85
## 4 6 3.08
## 5 8 3.15
## 6 6 2.76
14.9 Group verb summarise()
This verb creates a summary row for each group if grouped data frame is in input, otherwise one single for complete operation.
Example-1:
## total_wt
## 1 102.952
Example-2:
## # A tibble: 3 × 2
## cyl total_wt
## <dbl> <dbl>
## 1 4 25.1
## 2 6 21.8
## 3 8 56.0
14.10 Group verb ungroup()
But now, it may have been noticed that grouping the data with group_by
and thereafter performing some operation like mutate
or slice
returns the grouped data. But our requirement in next steps would be of ungrouped data. so there is specially designed function in dplyr which ungroups the grouped data.
## # A tibble: 6 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
## 3 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 4 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 5 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 6 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
14.11 New argument .by
in mutate
, summarise
, filter
, etc.
Since grouping the data in one step, then performing required step and then again ungrouping it is somewhat cumbersome, coming dplyr 1.1.0 version, the functions mutate()
, summarise
, slice
etc. have gained an additional argument .by
wherein the variable names for performing grouped operations may be provided directly, thus eliminating the need of both grouping and ungrouping it. So the following two syntax would produce exactly same results.
## # A tibble: 3 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
14.13 Other Useful functions in dplyr
if_else()
This function operates nearly as similar to base R’s ifelse()
with two exceptions-
- There is an extra argument to provide values when missing values are encountered. (See example-1)
-
NA
will have to be provided specifically. (See Example-2)
See these examples. Example-1:
## [1] "negative" "negative" "positive" "positive" "positive" "missing"
Example-2:
## [1] "negative" "negative" "positive" "positive" "positive" NA
Due to the additional restrictions, this function is sometimes faster than its base R alternative and may also be useful in prevention of bugs in code as the output will be known beforehand.
case_when()
Though both ifelse
and if_else
variants provide for nesting multiple conditions, yet case_when
provides a simpler alternative in these conditions as here multiple conditions can be provided simultaneously. Syntax follows this style-
case_when(
condition1 ~ value_if_true,
condition2 ~ value_if_true,
...,
TRUE ~ value_if_all_above_are_false
)
See this example.
set.seed(123)
# Generate Incomes from random (uniform) distribution
income <- runif(7, 1, 9)*100000
income
## [1] 330062.0 730644.1 427181.5 806413.9 852373.8 136445.2 522484.4
# tax brackets say 0% upto 2 lakh, then 10% upto 5 Lakh
# then 20% upto 7.5 lakh otherwise 30%
tax_slab <- case_when(
income <= 200000 ~ 0,
income <= 500000 ~ 10,
income <= 750000 ~ 20,
TRUE ~ 30
)
# check tax_slab
data.frame(
income=income,
tax_slab = tax_slab
)
## income tax_slab
## 1 330062.0 10
## 2 730644.1 20
## 3 427181.5 10
## 4 806413.9 30
## 5 852373.8 30
## 6 136445.2 0
## 7 522484.4 20
case_match()
This function is somewhat similar to case_when
. While case_when()
uses logical expressions on the left-hand side of the formula, case_match()
uses values to match with. Syntax is like
case_match(
.x, # A vector whose values are to be matched
..., # Basically for a condition like .x %in% Y, only Y has to be specified
.default = NULL, # Default values to be specified for if no match is found
)
So, above example could be rewritten as
tax_slab <- case_match(
floor(income), # Vector to be checked
0:200000 ~ 0, # slab 1
200001:500000 ~ 10, # slab 2
500001:750000 ~ 20, # slab 3
.default = 30 # Slab 4
)
# check tax_slab
data.frame(
income=income,
tax_slab = tax_slab
)
## income tax_slab
## 1 330062.0 10
## 2 730644.1 20
## 3 427181.5 10
## 4 806413.9 30
## 5 852373.8 30
## 6 136445.2 0
## 7 522484.4 20
14.15 Window functions/operations
We learnt that by using group_by
function we can create windows in data and we can make our calculations in each separate window specifically.
Dplyr provides us with some useful window functions which will operate on these windows.
-
row_number()
can be used to generate row number -
dense_rank
/min_rank
/percent_rank()
/ntile()
/cume_dist()
are other windowed functions in dplyr. Check?dplyr::ranking
for complete reference. -
lead()
andlag()
will give leading/lagging value in that window. -
consecutive_id()
generates a unique identifier that increments every time a variable (or combination of variables) changes.
These functions can be very helpful while analyzing time series data.
Example-1: Generating row numbers and ranks
# example data
df <- data.frame(
val = c(10, 2, 3, 2, NA)
)
df %>%
mutate(
row = row_number(),
min_rank = min_rank(val),
dense_rank = dense_rank(val),
perc_rank = percent_rank(val),
cume_dist = cume_dist(val)
)
## val row min_rank dense_rank perc_rank cume_dist
## 1 10 1 4 3 1.0000000 1.00
## 2 2 2 1 1 0.0000000 0.50
## 3 3 3 3 2 0.6666667 0.75
## 4 2 4 1 1 0.0000000 0.50
## 5 NA 5 NA NA NA NA
Example-2: Using previous and next values
Orange %>%
group_by(Tree) %>%
mutate(prev_circ = lag(circumference),
next_circ = lead(circumference))
## # A tibble: 35 × 5
## # Groups: Tree [5]
## Tree age circumference prev_circ next_circ
## <ord> <dbl> <dbl> <dbl> <dbl>
## 1 1 118 30 NA 58
## 2 1 484 58 30 87
## 3 1 664 87 58 115
## 4 1 1004 115 87 120
## 5 1 1231 120 115 142
## 6 1 1372 142 120 145
## 7 1 1582 145 142 NA
## 8 2 118 33 NA 69
## 9 2 484 69 33 111
## 10 2 664 111 69 156
## # ℹ 25 more rows
Example-3: generating run length encoding/consecutive ID
df <- data.frame(
grp = c("A", "B", "B", "A", "A", "B"),
value = c(1, 2, 3 , 4, 5, 6)
)
df %>%
mutate(RLE = consecutive_id(grp))
## grp value RLE
## 1 A 1 1
## 2 B 2 2
## 3 B 3 2
## 4 A 4 3
## 5 A 5 3
## 6 B 6 4