14 Data Transformation in dplyr

Prerequisites Obviously dplyr22 will be needed. This package also comes with matrittr pipe i.e. %>% and therefore in dplyr syntax we will be using these pipes. library(magrittr) is not needed.

The package dplyr (14.1) calls its functions as ‘verbs’ because these are actually doing some action. So dplyr verbs can be divided in three classifications depending upon where they operate -

  • ‘Row’ verbs that operate on Rows
  • ‘Column’ verbs
  • ‘group’ verbs that operate on table split into different groups.
Package Dplyr

Figure 14.1: Package Dplyr

Let’s learn each of these -

14.1 Column verb select()

In real world data sets we will often come across with data frames having numerous columns. However for many of the data analysis tasks, most of these columns are not needed. As already stated select (figure 14.2) operates on columns. Like SELECT in SQL, it just select the column(s) from the given data frame. The basic syntax is - select(data_frame, column_names, ...). So with pipes the same syntax goes like this

data_frame %>% 
  select(column_name)
Illustration of dplyr::select()

Figure 14.2: Illustration of dplyr::select()

For example, let’s try our hands on mtcars dataset.
Example-1

mtcars %>% 
  select(mpg)
##                      mpg
## Mazda RX4           21.0
## Mazda RX4 Wag       21.0
## Datsun 710          22.8
## Hornet 4 Drive      21.4
## Hornet Sportabout   18.7
## Valiant             18.1
## Duster 360          14.3
## Merc 240D           24.4
## Merc 230            22.8
## Merc 280            19.2
## Merc 280C           17.8
## Merc 450SE          16.4
## Merc 450SL          17.3
## Merc 450SLC         15.2
## Cadillac Fleetwood  10.4
## Lincoln Continental 10.4
## Chrysler Imperial   14.7
## Fiat 128            32.4
## Honda Civic         30.4
## Toyota Corolla      33.9
## Toyota Corona       21.5
## Dodge Challenger    15.5
## AMC Javelin         15.2
## Camaro Z28          13.3
## Pontiac Firebird    19.2
## Fiat X1-9           27.3
## Porsche 914-2       26.0
## Lotus Europa        30.4
## Ford Pantera L      15.8
## Ferrari Dino        19.7
## Maserati Bora       15.0
## Volvo 142E          21.4

Note that output is still a data frame unlike the mtcars[['mpg']] which returns a vector. We can subset multiple columns here. Example-2

mtcars %>% 
  select(mpg, qsec) %>% 
  head()
##                    mpg  qsec
## Mazda RX4         21.0 16.46
## Mazda RX4 Wag     21.0 17.02
## Datsun 710        22.8 18.61
## Hornet 4 Drive    21.4 19.44
## Hornet Sportabout 18.7 17.02
## Valiant           18.1 20.22

We can also provide column numbers instead of names. Example-3

mtcars %>% 
  select(4, 6) %>% 
  tail()
##                 hp    wt
## Porsche 914-2   91 2.140
## Lotus Europa   113 1.513
## Ford Pantera L 264 3.170
## Ferrari Dino   175 2.770
## Maserati Bora  335 3.570
## Volvo 142E     109 2.780

We can also use select to reorder the columns in output, by using a dplyr helping verb everything() which is basically everything else. See this example-

mtcars %>% 
  select(qsec, mpg, everything()) %>% 
  names()
##  [1] "qsec" "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "vs"   "am"   "gear"
## [11] "carb"

We may also use mix and match of column names and column numbers. See

mtcars %>% 
  select(5, 7, mpg, everything()) %>% 
  names()
##  [1] "drat" "qsec" "mpg"  "cyl"  "disp" "hp"   "wt"   "vs"   "am"   "gear"
## [11] "carb"

Operator : can also be used with column names. Ex-

mtcars %>% 
  select(mpg:drat) %>% 
  head(n=4)
##                 mpg cyl disp  hp drat
## Mazda RX4      21.0   6  160 110 3.90
## Mazda RX4 Wag  21.0   6  160 110 3.90
## Datsun 710     22.8   4  108  93 3.85
## Hornet 4 Drive 21.4   6  258 110 3.08

Other helping verbs There are other helping verbs, apart from everything() that can be used within select() just to eliminate need to type the column names and select columns based on some conditions. These verbs are self explanatory-

  • starts_with('ABC') will select all columns the names of which starts with string ABC
  • ends_with('ABC') will select all columns the names of which ends with string ABC
  • contains('ABC') will select all columns the names of which contains string ABC
  • num_range('A', 1:3) will select all columns named A1, A2 and A3

Some Examples.

  1. This syntax will select all columns which ends with "color" in column names.
starwars %>% 
  select(ends_with('color'))
## # A tibble: 87 × 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ℹ 77 more rows

Example-2: This syntax will select all columns contains characters "or" in column names.

starwars %>% 
  select(contains('or'))
## # A tibble: 87 × 4
##    hair_color    skin_color  eye_color homeworld
##    <chr>         <chr>       <chr>     <chr>    
##  1 blond         fair        blue      Tatooine 
##  2 <NA>          gold        yellow    Tatooine 
##  3 <NA>          white, blue red       Naboo    
##  4 none          white       yellow    Tatooine 
##  5 brown         light       brown     Alderaan 
##  6 brown, grey   light       blue      Tatooine 
##  7 brown         light       blue      Tatooine 
##  8 <NA>          white, red  red       Tatooine 
##  9 black         light       brown     Tatooine 
## 10 auburn, white fair        blue-gray Stewjon  
## # ℹ 77 more rows

Using where(): This selection helper selects the variables for which a function when applied to column names, returns TRUE.

Example-3: This syntax will select all numeric columns from the dataset.

starwars %>% 
  select(where(is.numeric))
## # A tibble: 87 × 3
##    height  mass birth_year
##     <int> <dbl>      <dbl>
##  1    172    77       19  
##  2    167    75      112  
##  3     96    32       33  
##  4    202   136       41.9
##  5    150    49       19  
##  6    178   120       52  
##  7    165    75       47  
##  8     97    32       NA  
##  9    183    84       24  
## 10    182    77       57  
## # ℹ 77 more rows

14.2 Column verbmutate()

This perhaps is one of the most important functions in dplyr kitty. It enables us to create new column(s) that are functions of one or more existing columns. Refer figure 14.3

Illustration of dplyr::mutate()

Figure 14.3: Illustration of dplyr::mutate()

More than one column can be added simultaneously. Newly created column may also be used for creation of another new column. See example.

starwars %>% 
  select(name:mass) %>% 
  mutate(name_upper = toupper(name),
         BMI = mass/(height/100)^2)
## # A tibble: 87 × 5
##    name               height  mass name_upper           BMI
##    <chr>               <int> <dbl> <chr>              <dbl>
##  1 Luke Skywalker        172    77 LUKE SKYWALKER      26.0
##  2 C-3PO                 167    75 C-3PO               26.9
##  3 R2-D2                  96    32 R2-D2               34.7
##  4 Darth Vader           202   136 DARTH VADER         33.3
##  5 Leia Organa           150    49 LEIA ORGANA         21.8
##  6 Owen Lars             178   120 OWEN LARS           37.9
##  7 Beru Whitesun Lars    165    75 BERU WHITESUN LARS  27.5
##  8 R5-D4                  97    32 R5-D4               34.0
##  9 Biggs Darklighter     183    84 BIGGS DARKLIGHTER   25.1
## 10 Obi-Wan Kenobi        182    77 OBI-WAN KENOBI      23.2
## # ℹ 77 more rows

By default the new column will be added to the last of data frame. As shown in above example, more operations can be combined in one using %>%. There is a cousin transmute() of mutate which drops all the old columns and keeps only newly created columns. Example

starwars %>% 
  transmute(name_upper = toupper(name))
## # A tibble: 87 × 1
##    name_upper        
##    <chr>             
##  1 LUKE SKYWALKER    
##  2 C-3PO             
##  3 R2-D2             
##  4 DARTH VADER       
##  5 LEIA ORGANA       
##  6 OWEN LARS         
##  7 BERU WHITESUN LARS
##  8 R5-D4             
##  9 BIGGS DARKLIGHTER 
## 10 OBI-WAN KENOBI    
## # ℹ 77 more rows

Other useful dplyr functions Another good use of mutate is to generate summarised result and display it corresponding to each row in data. For example if the requirement is to calculate proportion of say wt column in mtcars data.

mtcars %>% 
  head() %>% 
  select(wt) %>% 
  mutate(total_wt = sum(wt),
         wt_proportion = wt*100/total_wt) 
##                      wt total_wt wt_proportion
## Mazda RX4         2.620    17.93      14.61238
## Mazda RX4 Wag     2.875    17.93      16.03458
## Datsun 710        2.320    17.93      12.93921
## Hornet 4 Drive    3.215    17.93      17.93084
## Hornet Sportabout 3.440    17.93      19.18572
## Valiant           3.460    17.93      19.29727
  1. n() is used to count number of rows
  2. n_distinct() is used to count number of distinct values for the given variable
mtcars %>% 
  select(1:5) %>% 
  mutate(total_cars = n()) %>% 
  head()
##                    mpg cyl disp  hp drat total_cars
## Mazda RX4         21.0   6  160 110 3.90         32
## Mazda RX4 Wag     21.0   6  160 110 3.90         32
## Datsun 710        22.8   4  108  93 3.85         32
## Hornet 4 Drive    21.4   6  258 110 3.08         32
## Hornet Sportabout 18.7   8  360 175 3.15         32
## Valiant           18.1   6  225 105 2.76         32

14.3 Column verb rename()

It is used to rename the column names. Refer figure 14.4 for illustration.

Illustration of dplyr::rename()

Figure 14.4: Illustration of dplyr::rename()

See this example

mtcars %>% 
  rename(miles_per_gallon = mpg) %>% 
  head(3)
##               miles_per_gallon cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4                 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag             21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710                22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Note that select can also rename the columns but will drop all unselected columns. Check this

mtcars %>% 
  select(miles_per_gallon = mpg) %>% 
  head(3)
##               miles_per_gallon
## Mazda RX4                 21.0
## Mazda RX4 Wag             21.0
## Datsun 710                22.8

14.4 Column verb relocate()

It relocates column or block of columns simultaneosly either before the column mentioned in argument .before or after mentioned in .after. See the example-

starwars %>% 
  relocate(ends_with('color'), .after = name) %>% 
  head(5)
## # A tibble: 5 × 14
##   name      hair_color skin_color eye_color height  mass birth_year sex   gender
##   <chr>     <chr>      <chr>      <chr>      <int> <dbl>      <dbl> <chr> <chr> 
## 1 Luke Sky… blond      fair       blue         172    77       19   male  mascu…
## 2 C-3PO     <NA>       gold       yellow       167    75      112   none  mascu…
## 3 R2-D2     <NA>       white, bl… red           96    32       33   none  mascu…
## 4 Darth Va… none       white      yellow       202   136       41.9 male  mascu…
## 5 Leia Org… brown      light      brown        150    49       19   fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

14.5 Row verb filter

This verb/function is used to subset the data, or in other words filter rows of data frame based on certain condition. Refer figure 14.5 for illustration.

Illustration of dplyr::filter()

Figure 14.5: Illustration of dplyr::filter()

See this example-

starwars %>% 
  filter(eye_color %in% c('red', 'yellow'))
## # A tibble: 16 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  2 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  3 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  4 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  5 Palpati…    170    75 grey       pale       yellow          82   male  mascu…
##  6 IG-88       200   140 none       metal      red             15   none  mascu…
##  7 Bossk       190   113 none       green      red             53   male  mascu…
##  8 Nute Gu…    191    90 none       mottled g… red             NA   male  mascu…
##  9 Watto       137    NA black      blue, grey yellow          NA   male  mascu…
## 10 Darth M…    175    80 none       red        yellow          54   male  mascu…
## 11 Dud Bolt     94    45 none       blue, grey yellow          NA   male  mascu…
## 12 Ki-Adi-…    198    82 white      pale       yellow          92   male  mascu…
## 13 Yarael …    264    NA none       white      yellow          NA   male  mascu…
## 14 Poggle …    183    80 none       green      yellow          NA   male  mascu…
## 15 Zam Wes…    168    55 blonde     fair, gre… yellow          NA   fema… femin…
## 16 Dexter …    198   102 none       brown      yellow          NA   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Multiple conditions can be passed simultaneously. Example

starwars %>% 
  filter(skin_color == 'white',
         height >= 150)
## # A tibble: 2 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 2 Yarael P…    264    NA none       white      yellow          NA   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Note that these conditions act simultaneously as in operator AND is used. So if OR is to be used, use | explicitly

starwars %>% 
  filter(skin_color == 'white' | height >= 150) %>% 
  nrow()
## [1] 71

14.6 Row verbs slice() / slice_*()

slice() and its cousins also filters rows but based on rows placement. So, data_fr %>% slice(1:5) will filter out first five rows of the data_fr. See example

starwars %>% 
  slice(4:10) # filter 4 to 10th row
## # A tibble: 7 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 2 Leia Org…    150    49 brown      light      brown           19   fema… femin…
## 3 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
## 4 Beru Whi…    165    75 brown      light      blue            47   fema… femin…
## 5 R5-D4         97    32 <NA>       white, red red             NA   none  mascu…
## 6 Biggs Da…    183    84 black      light      brown           24   male  mascu…
## 7 Obi-Wan …    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Other slice() cousins -

  • slice_head(5) will slice out first 5 rows
  • slice_tail(10) will slice out last 10 rows
  • slice_min() or slice_max() will slice rows with highest or lowest values of given variable. The full syntax is slice_max(.data, order_by, ..., n, prop, with_ties = TRUE) or equivalent
  • slice_sample() will randomly select the rows. Its syntax is slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)

Example-1:

starwars %>% 
  slice_min(height, n=3)
## # A tibble: 3 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Yoda          66    17 white      green      brown            896 male  mascu…
## 2 Ratts Ty…     79    15 none       grey, blue unknown           NA male  mascu…
## 3 Wicket S…     88    20 brown      brown      brown              8 male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Example-2:

set.seed(2022)
starwars %>% 
  slice_sample(prop = 0.1) #sample 10% rows
## # A tibble: 8 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Ki-Adi-M…    198  82   white      pale       yellow            92 male  mascu…
## 2 Grievous     216 159   none       brown, wh… green, y…         NA male  mascu…
## 3 Saesee T…    188  NA   none       pale       orange            NA male  mascu…
## 4 Wat Tamb…    193  48   none       green, gr… unknown           NA male  mascu…
## 5 Jango Fe…    183  79   black      tan        brown             66 male  mascu…
## 6 Owen Lars    178 120   brown, gr… light      blue              52 male  mascu…
## 7 Luminara…    170  56.2 black      yellow     blue              58 fema… femin…
## 8 BB8           NA  NA   none       none       black             NA none  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

14.7 Row verb arrange()

This verb also act upon rows and it actually rearranges them on the basis of some condition. Refer figure 14.6 for illustration.

Illustration of dplyr::arrange()

Figure 14.6: Illustration of dplyr::arrange()

Example-

starwars %>% 
  arrange(height) %>% 
  slice(1:5)
## # A tibble: 5 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Yoda          66    17 white      green      brown            896 male  mascu…
## 2 Ratts Ty…     79    15 none       grey, blue unknown           NA male  mascu…
## 3 Wicket S…     88    20 brown      brown      brown              8 male  mascu…
## 4 Dud Bolt      94    45 none       blue, grey yellow            NA male  mascu…
## 5 R2-D2         96    32 <NA>       white, bl… red               33 none  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

14.8 Group verb group_by()

A data analyst will be hard to find who is not using group_by. It basically groups the rows on the basis of values of a given variable or block of variables. The returned result is still a data frame (and one too) but now the rows are grouped. Refer figure 14.7 for illustration. So any of the above functions we learnt above will give a different result after group by.

Illustration of Grouped Operations in dplyr

Figure 14.7: Illustration of Grouped Operations in dplyr

Note the output of this simple example

starwars %>% 
  group_by(sex)
## # A tibble: 87 × 14
## # Groups:   sex [5]
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Note that output now has 5 groups, though nothing different is seen in the displayed data.

This operation/verb is thus more useful if used in combination with other verbs.

Example-1: How many total characters are with same skin_color?

starwars %>% 
  select(name, skin_color) %>% 
  group_by(skin_color) %>% 
  mutate(total_with_s_c = n())
## # A tibble: 87 × 3
## # Groups:   skin_color [31]
##    name               skin_color  total_with_s_c
##    <chr>              <chr>                <int>
##  1 Luke Skywalker     fair                    17
##  2 C-3PO              gold                     1
##  3 R2-D2              white, blue              2
##  4 Darth Vader        white                    2
##  5 Leia Organa        light                   11
##  6 Owen Lars          light                   11
##  7 Beru Whitesun Lars light                   11
##  8 R5-D4              white, red               1
##  9 Biggs Darklighter  light                   11
## 10 Obi-Wan Kenobi     fair                    17
## # ℹ 77 more rows

Example- 2: Sample 2 rows of each cyl size from mtcars?

set.seed(123)
mtcars %>% 
  group_by(cyl) %>% 
  slice_sample(n=2)
## # A tibble: 6 × 11
## # Groups:   cyl [3]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 2  21.4     4  121    109  4.11  2.78  18.6     1     1     4     2
## 3  21       6  160    110  3.9   2.88  17.0     0     1     4     4
## 4  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6
## 5  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
## 6  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4

Also note that grouped varaible(s) will always be available in the output.

mtcars %>% 
  group_by(cyl) %>% 
  select(drat) %>% # despite not selecting cyl
  head() # it is available in output
## Adding missing grouping variables: `cyl`
## # A tibble: 6 × 2
## # Groups:   cyl [3]
##     cyl  drat
##   <dbl> <dbl>
## 1     6  3.9 
## 2     6  3.9 
## 3     4  3.85
## 4     6  3.08
## 5     8  3.15
## 6     6  2.76

14.9 Group verb summarise()

This verb creates a summary row for each group if grouped data frame is in input, otherwise one single for complete operation.
Example-1:

mtcars %>% 
  summarise(total_wt = sum(wt))
##   total_wt
## 1  102.952

Example-2:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(total_wt = sum(wt))
## # A tibble: 3 × 2
##     cyl total_wt
##   <dbl>    <dbl>
## 1     4     25.1
## 2     6     21.8
## 3     8     56.0

14.10 Group verb ungroup()

But now, it may have been noticed that grouping the data with group_by and thereafter performing some operation like mutate or slice returns the grouped data. But our requirement in next steps would be of ungrouped data. so there is specially designed function in dplyr which ungroups the grouped data.

set.seed(123)
mtcars %>% 
  group_by(cyl) %>% 
  slice_sample(n=2) %>% 
  ungroup()
## # A tibble: 6 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 2  21.4     4  121    109  4.11  2.78  18.6     1     1     4     2
## 3  21       6  160    110  3.9   2.88  17.0     0     1     4     4
## 4  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6
## 5  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
## 6  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4

14.11 New argument .by in mutate, summarise, filter, etc.

Since grouping the data in one step, then performing required step and then again ungrouping it is somewhat cumbersome, coming dplyr 1.1.0 version, the functions mutate(), summarise, slice etc. have gained an additional argument .by wherein the variable names for performing grouped operations may be provided directly, thus eliminating the need of both grouping and ungrouping it. So the following two syntax would produce exactly same results.

# Syntax - 1 (old style)

mtcars %>% 
  group_by(cyl) %>% 
  slice(2) %>% 
  ungroup()
## # A tibble: 3 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
## 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
## 3  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
# Syntax 2 (Dplyr 1.1.0 +)

mtcars %>% 
  slice(2, .by = cyl)
##                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Merc 240D     24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Duster 360    14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4

14.12 Group verb reframe()

The content is under development.

14.13 Other Useful functions in dplyr

if_else()

This function operates nearly as similar to base R’s ifelse() with two exceptions-

  • There is an extra argument to provide values when missing values are encountered. (See example-1)
  • NA will have to be provided specifically. (See Example-2)

See these examples. Example-1:

x <- c(-2:2, NA)
if_else(x>=0, "positive", "negative", "missing")
## [1] "negative" "negative" "positive" "positive" "positive" "missing"

Example-2:

x <- c(-2:2, NA)
if_else(x>=0, "positive", "negative", NA_character_)
## [1] "negative" "negative" "positive" "positive" "positive" NA

Due to the additional restrictions, this function is sometimes faster than its base R alternative and may also be useful in prevention of bugs in code as the output will be known beforehand.

case_when()

Though both ifelse and if_else variants provide for nesting multiple conditions, yet case_when provides a simpler alternative in these conditions as here multiple conditions can be provided simultaneously. Syntax follows this style-

case_when(
  condition1 ~ value_if_true,
  condition2 ~ value_if_true,
  ...,
  TRUE ~ value_if_all_above_are_false
)

See this example.

set.seed(123)
# Generate Incomes from random (uniform) distribution
income <- runif(7, 1, 9)*100000
income
## [1] 330062.0 730644.1 427181.5 806413.9 852373.8 136445.2 522484.4
# tax brackets say 0% upto 2 lakh, then 10% upto 5 Lakh
# then 20% upto 7.5 lakh otherwise 30%
tax_slab <- case_when(
  income <= 200000 ~ 0,
  income <= 500000 ~ 10,
  income <= 750000 ~ 20,
  TRUE ~ 30
)

# check tax_slab
data.frame(
  income=income,
  tax_slab = tax_slab
)
##     income tax_slab
## 1 330062.0       10
## 2 730644.1       20
## 3 427181.5       10
## 4 806413.9       30
## 5 852373.8       30
## 6 136445.2        0
## 7 522484.4       20

case_match()

This function is somewhat similar to case_when. While case_when() uses logical expressions on the left-hand side of the formula, case_match() uses values to match with. Syntax is like

case_match(
  .x,  # A vector whose values are to be matched
  ..., # Basically for a condition like .x %in% Y, only Y has to be specified
  .default = NULL, # Default values to be specified for if no match is found
)

So, above example could be rewritten as

tax_slab <- case_match(
  floor(income), # Vector to be checked
  0:200000 ~ 0, # slab 1
  200001:500000 ~ 10, # slab 2
  500001:750000 ~ 20, # slab 3
  .default = 30 # Slab 4
)

# check tax_slab
data.frame(
  income=income,
  tax_slab = tax_slab
)
##     income tax_slab
## 1 330062.0       10
## 2 730644.1       20
## 3 427181.5       10
## 4 806413.9       30
## 5 852373.8       30
## 6 136445.2        0
## 7 522484.4       20

14.14 Functional programming in dplyr through across()

The content is under development.

14.15 Window functions/operations

We learnt that by using group_by function we can create windows in data and we can make our calculations in each separate window specifically.

Dplyr provides us with some useful window functions which will operate on these windows.

  1. row_number() can be used to generate row number
  2. dense_rank / min_rank / percent_rank() / ntile() / cume_dist() are other windowed functions in dplyr. Check ?dplyr::ranking for complete reference.
  3. lead() and lag() will give leading/lagging value in that window.
  4. consecutive_id() generates a unique identifier that increments every time a variable (or combination of variables) changes.

These functions can be very helpful while analyzing time series data.

Example-1: Generating row numbers and ranks

# example data
df <- data.frame(
  val = c(10, 2, 3, 2, NA)
)

df %>% 
  mutate(
    row = row_number(),
    min_rank = min_rank(val),
    dense_rank = dense_rank(val),
    perc_rank = percent_rank(val),
    cume_dist = cume_dist(val)
  )
##   val row min_rank dense_rank perc_rank cume_dist
## 1  10   1        4          3 1.0000000      1.00
## 2   2   2        1          1 0.0000000      0.50
## 3   3   3        3          2 0.6666667      0.75
## 4   2   4        1          1 0.0000000      0.50
## 5  NA   5       NA         NA        NA        NA

Example-2: Using previous and next values

Orange %>% 
  group_by(Tree) %>% 
  mutate(prev_circ = lag(circumference),
         next_circ = lead(circumference))
## # A tibble: 35 × 5
## # Groups:   Tree [5]
##    Tree    age circumference prev_circ next_circ
##    <ord> <dbl>         <dbl>     <dbl>     <dbl>
##  1 1       118            30        NA        58
##  2 1       484            58        30        87
##  3 1       664            87        58       115
##  4 1      1004           115        87       120
##  5 1      1231           120       115       142
##  6 1      1372           142       120       145
##  7 1      1582           145       142        NA
##  8 2       118            33        NA        69
##  9 2       484            69        33       111
## 10 2       664           111        69       156
## # ℹ 25 more rows

Example-3: generating run length encoding/consecutive ID

df <- data.frame(
  grp = c("A", "B", "B", "A", "A", "B"),
  value = c(1, 2, 3 , 4, 5, 6)
)

df %>% 
  mutate(RLE = consecutive_id(grp))
##   grp value RLE
## 1   A     1   1
## 2   B     2   2
## 3   B     3   2
## 4   A     4   3
## 5   A     5   3
## 6   B     6   4