29 Regex - A quick introduction

A Regular Expression, or regex for short, is a powerful tool, which helps us writing code for pattern matching in texts. Regex, is a pattern that describes a set of strings. It is a sequence of characters that define a search pattern. It is used to search for and manipulate text. Regex can be used in many programming languages, including R.

Regex patterns are made up of a combination of regular characters and special characters. Regular characters include letters, digits, and punctuation marks. Special characters have a specific meaning in regex and are used to represent patterns of characters.

Regex patterns can be used for a variety of purposes, including:

Searching for specific strings in text
Extracting specific parts of a string
Replacing parts of a string with other text
Validating input from users

In R, we can use the grep and gsub functions to search for and manipulate text using regex.

29.1 Basic Regex - Literal Characters

Every literal character, in itself is a regex that matches itself. Thus, a matches third character in text Charles. These literal characters are case sensitive.

Example-1

ex_text <- "This is an example text"
# Match literal `x`
str_view(ex_text, "x")

## [1] │ This is an e<x>ample te<x>t

# Match Upper case literal "X"
str_view(ex_text, "X", match = NA)

## [1] │ This is an example text

29.1.1 Case sensitivity

As the literals are case_sensitive and we sometimes are not aware of exact case, to match case insensitive literals, we can make use of stringr function regex in this case, wherein an argument ignore_case (note snake case) is there. Actually, behind the scenes, all regex expressions in stringr are wrapped in this function with argument defaults as FALSE. Thus, the code in above example is actually equivalent to the following-

# Match literal `x`
str_view(ex_text, regex("x"))

## [1] │ This is an e<x>ample te<x>t

# Match Upper case literal "X"
str_view(ex_text, regex("X"), match = NA)

## [1] │ This is an example text

Thus, to match case insensitive literals (or other regex expressions) we may make use of the argument ignore_case like this-

# Match literal `x`
str_view(ex_text, regex("X", ignore_case = TRUE))

## [1] │ This is an e<x>ample te<x>t

29.2 Metacharacters

29.2.1 Character sets

It is always not feasible to put every literal characters. We may also match literal characters from a given set of options. To match a group of characters we have to put all these in square brackets. So, [abc] matches either of a, b, or c.

Example-

ex_vec <- c("Apple", "Orange", "Myrrh")
# matches a vowel
str_view(ex_vec, "[aeiou]")

## [1] │ Appl<e>
## [2] │ Or<a>ng<e>

# matches a vowel irrespective of case
str_view(ex_vec, regex("[aeiou]", ignore_case = TRUE))

## [1] │ <A>ppl<e>
## [2] │ <O>r<a>ng<e>

To match a range of characters/numbers we can separate these by hyphen in square brackets. So, [a-n] will match a character from range [abcdefghijklmn].

Example-

ex_text <- "The quick brown fox jumps over the lazy dog"
# Match a, b or c in lower case
str_view(ex_text, regex("[a-c]"))

## [1] │ The qui<c>k <b>rown fox jumps over the l<a>zy dog

Example-2

ex_colors <- c("grey", "black", "gray")
str_view(ex_colors, "gr[ae]y")

## [1] │ <grey>
## [3] │ <gray>

We can also use pre-built character classes listed below.

[:punct:] punctuation.
[:alpha:] letters.
[:lower:] lowercase letters.
[:upper:] uppercase letters.
[:digit:] digits.
[:xdigit:] hex digits.
[:alnum:] letters and numbers.
[:cntrl:] control characters.
[:graph:] letters, numbers, and punctuation.
[:print:] letters, numbers, punctuation, and white-space.
[:space:] space characters (basically equivalent to \\s).
[:blank:] space and tab.

Example-

ex_vec2 <- c("One apple", "2 Oranges")
str_view(ex_vec2, "[:digit:]", match = NA)

## [1] │ One apple
## [2] │ <2> Oranges

29.2.2 Negation of character sets/classes

Typing a caret [^] after the opening square bracket negates the character class. Example-

ex_colors <- c("grey", "black", "gray")
str_view(ex_colors, "gr[^c-f]y")

## [3] │ <gray>

So, in this case gr followed by character except from c to f and further followed by y will only be matched, resulting in matching gray but not matching grey. So, putting a ^ caret character inside the character classes before putting matching characters/classes will match everything except those characters/classes.

29.2.3 Non-printable characters/ Meta characters (short-hand character classes)

We can use special character sequences to put non-printable characters in our regular expression(s). E.g. \t matches a tab character. But since \ is an escape character in R, we need to escape it too. So to match a tab character we have to put \\t in our regex sequence. Regex for that matches new line (line feed) is \\n. Regex for other meta characters is listed below-

\\s matches a white-space character. Moreover, its complement \\S matches any character except a white-space.
\\w matches any alphanumeric character. Similarly, its complement is \\W which matches any character except alphanumeric characters.
\\d matches any digit. Similarly, its complement is \\D which matches any character except digits.
\\b matches any word boundary. Thus, \\B matches any character except a word boundary.
. matches any character. To match a literal dot . we have to escape that; and thus \\. matches a dot character.

See these examples-

ex_vec3 <- c("One apple", "2 oranges & 3 bananas.")
# match word character
str_view(ex_vec3, "\\w", match = NA)

## [1] │ <O><n><e> <a><p><p><l><e>
## [2] │ <2> <o><r><a><n><g><e><s> & <3> <b><a><n><a><n><a><s>.

# match any character followed by a dot character
str_view(ex_vec3, ".\\.", match = NA)

## [1] │ One apple
## [2] │ 2 oranges & 3 banana<s.>

# Note both character and dot will be matched

29.3 Quantifiers

What if we want to match more than one literal/character through regex? Let’s say if we want to check whether the given string or string vector contain two consecutive vowels. One method may be to use character classes two times i.e. using [aeiou][aeiou]. But this method is against the principles of DRY³⁵ which is one of the common principle of programming. To solve these issues, we have quantifiers.

+ 1 or more occurrences
* 0 or more
? 0 or 1
{} specified numbers
- {n} exactly n
- {n,} n or more
- {n,m} between n and m

Thus, we may match two consecutive vowels using [aeiou]{2}. See this example

ex_vec <- c("Apple", "Banana", "pineapple")
str_view(ex_vec, "[aeiou]{2}", match = NA)

## [1] │ Apple
## [2] │ Banana
## [3] │ pin<ea>pple

29.4 Alternation

Alternation in regular expressions allows you to match one pattern or another, depending on which one appears first in the input string. The pipe symbol | is used to separate the alternative patterns.

29.4.0.0.1 Basic Alternation

Let’s start with a basic example to illustrate how alternation works:

string <- "I have an apple and a banana"
pattern <- "apple|banana"

str_extract(string, pattern)

## [1] "apple"

29.4.0.0.2 Order of Precedence

When using alternation, it’s important to keep in mind the order of precedence rules. In general, the first pattern that matches the input string will be selected, and subsequent patterns will not be considered. Here’s an example to illustrate this:

string <- "I have a pineapple and an apple"
str_extract(string, pattern = "apple|pineapple")

## [1] "pineapple"

In this example, we have a string string that contains the words “apple” and “pineapple”. We want to extract the first occurrence of either “apple” or “pineapple” from this text using a regular expression pattern that utilizes alternation. The pattern apple|pineapple means “match ‘apple’ OR ‘pineapple’”. However, since the input string contains “pineapple” before “apple”, the str_extract() function selects the first matching string “pineapple”.

29.4.0.0.3 Grouping Alternatives

We can also use parentheses to group alternative patterns together. This can be useful for specifying more complex patterns. Example:

string <- "Apple and pineapples are good for health"
pattern <- "(apple|banana|cherry) (and|or) (pineapple|kiwi|mango)"

str_view(string, regex(pattern, ignore_case = TRUE))

## [1] │ <Apple and pineapple>s are good for health

In above examples, we have used stringr::regex() to modify regex flag to ignore cases while matching.

29.5 Anchors

Anchors in regular expressions allow you to match patterns at specific positions within the input string. In R, you can use various anchors in your regular expressions to match the beginning, end, or specific positions within the input text.

29.5.1 Beginning and End Anchors

The beginning anchor ^ and end anchor $ are used to match patterns at the beginning or end of the input string, respectively. Example

string <- "The quick brown fox jumps over the lazy dog. The fox is brown."
pattern <- "^the"
str_view(string, regex(pattern, ignore_case = TRUE))

## [1] │ <The> quick brown fox jumps over the lazy dog. The fox is brown.

In the above example, we are matching word the which is at the beginning of a sentence only.

29.5.2 Word Boundary Anchors

The word boundary anchor \\b is used to match patterns at the beginning or end of a word within the input string. Example

string <- 'Apple and pineapple, both are good for health'
pattern <- '\\bapple\\b'
str_view(string, regex(pattern, ignore_case = TRUE))

## [1] │ <Apple> and pineapple, both are good for health

In the above example, though apple string is contained in another word pineapple we are limiting our search for whole words only.

29.6 Capture Groups

A capture group is a way to group a part of a regular expression and capture it as a separate sub-string. This can be useful when you want to extract or replace a specific part of a string. In R, capture groups are denoted by parentheses (). Anything inside the parentheses is captured and can be referenced later in the regular expression or in the replacement string.

One use of capturing group is to refer back to it within a match with back reference: \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, and so on.

Example-1

my_fruits <- c('apple', 'banana', 'coconut', 'berry', 'cucumber', 'date')
# search for repeated alphabet
pattern <- '(.)\\1'
str_view(my_fruits, regex(pattern), match = NA)

## [1] │ a<pp>le
## [2] │ banana
## [3] │ coconut
## [4] │ be<rr>y
## [5] │ cucumber
## [6] │ date

Example-2

# search for repeated pair of alphabets
pattern <- '(..)\\1'
str_view(my_fruits, regex(pattern), match = NA)

## [1] │ apple
## [2] │ b<anan>a
## [3] │ <coco>nut
## [4] │ berry
## [5] │ <cucu>mber
## [6] │ date

Another way to use capturing group is, when we want to replace the pattern with something else. It is better to understand this with the following example-

# We have names in last_name, first_name format
names <- c('Hanks, Tom', 'Affleck, Ben', 'Damon, Matt')
str_view(names)

## [1] │ Hanks, Tom
## [2] │ Affleck, Ben
## [3] │ Damon, Matt

# Using this regex, we can convert these to first_name last_name format
str_replace_all(names, '(\\w+),\\s+(\\w+)', '\\2 \\1')

## [1] "Tom Hanks"   "Ben Affleck" "Matt Damon"

29.7 Lookarounds

Look-ahead and look-behinds are zero-width assertions in regex. They are used to match a pattern only if it is followed or preceded by another pattern, respectively. The pattern in the look-ahead or look-behind is not included in the match.

29.7.1 Lookahead

A look-ahead is used to match a pattern only if it is followed by another pattern. Positive Lookaheads are written as (?=...), where ... is the pattern that must follow the match.

For example, the regex pattern hello(?= world) matches “hello” only if it is followed by ” world”. It matches “hello world” but not “hello there world” or “hello”.

Example

string <- c("hello world", "hello there world", "hello")
str_view(string, "hello(?= world)", match = NA)

## [1] │ <hello> world
## [2] │ hello there world
## [3] │ hello

# Note that "world" is not included in the match

29.7.2 Lookbehind

A look-behind is used to match a pattern only if it is preceded by another pattern. Look-behinds are written as (?<=...), where ... is the pattern that must precede the match.

For example, the regex pattern (?<=hello )world matches “world” only if it is preceded by “hello”. It matches “hello world” but not “world hello” or “hello there world”.

Example

string <- c("hello world", "world hello", "hello there world")
str_view(string, "(?<=hello )world", match = NA)

## [1] │ hello <world>
## [2] │ world hello
## [3] │ hello there world

29.7.3 Negative Lookahead and Lookbehinds

Negative look-ahead and negative look-behinds are used to match a pattern only if it is not followed or preceded by another pattern, respectively. Negative look-ahead and look-behinds are written as (?!...) and (?<!...), respectively.

For example, the regex pattern hello(?! world) matches “hello” only if it is not followed by ” world”. It matches “hello there” but not “hello world” or “hello world there”.

Example-

string <- c("hello there", "hello world", "hello world there")
str_view(string, "hello(?! world)", match = NA)

## [1] │ <hello> there
## [2] │ hello world
## [3] │ hello world there

And the regex pattern (?<!hello )world matches “world” only if it is not preceded by “hello”. It matches “world hello” and “hello there world” but not “hello world”.

string <- c("hello world", "world hello", "hello there world")
str_view(string, "(?<!hello )world", match = NA)

## [1] │ hello world
## [2] │ <world> hello
## [3] │ hello there <world>

While the difference between the look-ahead and look-behind may be subtle, yet these become clear when string/pattern replacement or extraction is required.

Examples-

string <- "I have 10 apples, 6 pineapples and 5 bananas"

# look-behind to match "apples" preceded by a digit and a space
pattern1 <- "(?<=\\d\\s)apples"  

# look-ahead to match count of apples
pattern2 <- "\\d+(?=\\sapple)"  

str_view(string = string, pattern = pattern1, match = NA)

## [1] │ I have 10 <apples>, 6 pineapples and 5 bananas

str_view(string = string, pattern = pattern2, match = NA)

## [1] │ I have <10> apples, 6 pineapples and 5 bananas

29.8 Comments

29.8.1 Comments within regex

We can use the # character to add comments within a regex pattern. Any text following the # symbol on a line is ignored by the regex engine and treated as a comment. This can be useful for documenting your regex patterns or temporarily disabling parts of a pattern for testing or debugging. Example-

str_view(c("xyz","abc"), "x(?#this is a comment)", match = NA)

## [1] │ <x>yz
## [2] │ abc

29.8.2 Verbose Mode (multi-line comments)

In regular expressions, verbose mode is a feature that allows you to write more readable and maintainable regex patterns by adding comments and white-space without affecting their behavior. To enable verbose mode, we can use the (?x) or (?verbose) modifier at the beginning of your regex pattern.

Example - Using this regex we can extract words that contain a vowel at third place.

string <- "The quick brown fox jumps over the lazy dog"
pattern <- "(?x)      # Enable verbose mode
            \\b       # Match word boundary
            \\w{2}    # matches first two alphabets
            [aeiou]   # Match a vowel
            \\w*      # Match optional word characters
            \\b       # Match word boundary"
str_view(string, pattern, match = NA)

## [1] │ <The> <quick> <brown> fox jumps <over> <the> lazy dog

28 String manipulation in stringr

30 Regex in human readble format using rebus