29 Regex - A quick introduction
A Regular Expression, or regex for short, is a powerful tool, which helps us writing code for pattern matching in texts. Regex, is a pattern that describes a set of strings. It is a sequence of characters that define a search pattern. It is used to search for and manipulate text. Regex can be used in many programming languages, including R.
Regex patterns are made up of a combination of regular characters and special characters. Regular characters include letters, digits, and punctuation marks. Special characters have a specific meaning in regex and are used to represent patterns of characters.
Regex patterns can be used for a variety of purposes, including:
- Searching for specific strings in text
- Extracting specific parts of a string
- Replacing parts of a string with other text
- Validating input from users
In R, we can use the grep
and gsub
functions to search for and manipulate text using regex.
29.1 Basic Regex - Literal Characters
Every literal character, in itself is a regex
that matches itself. Thus, a
matches third character in text Charles
. These literal characters are case sensitive.
Example-1
ex_text <- "This is an example text"
# Match literal `x`
str_view(ex_text, "x")
## [1] │ This is an e<x>ample te<x>t
# Match Upper case literal "X"
str_view(ex_text, "X", match = NA)
## [1] │ This is an example text
29.1.1 Case sensitivity
As the literals are case_sensitive and we sometimes are not aware of exact case, to match case insensitive literals, we can make use of stringr
function regex
in this case, wherein an argument ignore_case
(note snake case) is there. Actually, behind the scenes, all regex expressions in stringr are wrapped in this function with argument defaults as FALSE
. Thus, the code in above example is actually equivalent to the following-
## [1] │ This is an e<x>ample te<x>t
## [1] │ This is an example text
Thus, to match case insensitive literals (or other regex expressions) we may make use of the argument ignore_case
like this-
## [1] │ This is an e<x>ample te<x>t
29.2 Metacharacters
29.2.1 Character sets
It is always not feasible to put every literal characters. We may also match literal characters from a given set of options. To match a group of characters we have to put all these in square brackets. So, [abc]
matches either of a
, b
, or c
.
Example-
## [1] │ Appl<e>
## [2] │ Or<a>ng<e>
## [1] │ <A>ppl<e>
## [2] │ <O>r<a>ng<e>
To match a range of characters/numbers we can separate these by hyphen in square brackets. So, [a-n]
will match a character from range [abcdefghijklmn]
.
Example-
ex_text <- "The quick brown fox jumps over the lazy dog"
# Match a, b or c in lower case
str_view(ex_text, regex("[a-c]"))
## [1] │ The qui<c>k <b>rown fox jumps over the l<a>zy dog
Example-2
## [1] │ <grey>
## [3] │ <gray>
We can also use pre-built character classes listed below.
-
[:punct:]
punctuation. -
[:alpha:]
letters. -
[:lower:]
lowercase letters. -
[:upper:]
uppercase letters. -
[:digit:]
digits. -
[:xdigit:]
hex digits. -
[:alnum:]
letters and numbers. -
[:cntrl:]
control characters. -
[:graph:]
letters, numbers, and punctuation. -
[:print:]
letters, numbers, punctuation, and white-space. -
[:space:]
space characters (basically equivalent to\\s
). -
[:blank:]
space and tab.
Example-
## [1] │ One apple
## [2] │ <2> Oranges
29.2.2 Negation of character sets/classes
Typing a caret [^
] after the opening square bracket negates the character class. Example-
## [3] │ <gray>
So, in this case gr
followed by character except from c
to f
and further followed by y
will only be matched, resulting in matching gray
but not matching grey
. So, putting a ^
caret character inside the character classes before putting matching characters/classes will match everything except those characters/classes.
29.2.3 Non-printable characters/ Meta characters (short-hand character classes)
We can use special character sequences to put non-printable characters in our regular expression(s). E.g. \t
matches a tab character. But since \
is an escape character in R, we need to escape it too. So to match a tab character we have to put \\t
in our regex sequence. Regex for that matches new line (line feed) is \\n
. Regex
for other meta characters is listed below-
-
\\s
matches a white-space character. Moreover, its complement\\S
matches any character except a white-space. -
\\w
matches any alphanumeric character. Similarly, its complement is\\W
which matches any character except alphanumeric characters. -
\\d
matches any digit. Similarly, its complement is\\D
which matches any character except digits. -
\\b
matches any word boundary. Thus,\\B
matches any character except a word boundary. -
.
matches any character. To match a literal dot.
we have to escape that; and thus\\.
matches a dot character.
See these examples-
ex_vec3 <- c("One apple", "2 oranges & 3 bananas.")
# match word character
str_view(ex_vec3, "\\w", match = NA)
## [1] │ <O><n><e> <a><p><p><l><e>
## [2] │ <2> <o><r><a><n><g><e><s> & <3> <b><a><n><a><n><a><s>.
# match any character followed by a dot character
str_view(ex_vec3, ".\\.", match = NA)
## [1] │ One apple
## [2] │ 2 oranges & 3 banana<s.>
# Note both character and dot will be matched
29.3 Quantifiers
What if we want to match more than one literal/character through regex
? Let’s say if we want to check whether the given string or string vector contain two consecutive vowels. One method may be to use character classes two times i.e. using [aeiou][aeiou]
. But this method is against the principles of DRY35 which is one of the common principle of programming. To solve these issues, we have quantifiers.
-
+
1 or more occurrences -
*
0 or more -
?
0 or 1 -
{}
specified numbers-
{n}
exactly n -
{n,}
n or more -
{n,m}
between n and m
-
Thus, we may match two consecutive vowels using [aeiou]{2}
. See this example
## [1] │ Apple
## [2] │ Banana
## [3] │ pin<ea>pple
29.4 Alternation
Alternation in regular expressions allows you to match one pattern or another, depending on which one appears first in the input string. The pipe symbol |
is used to separate the alternative patterns.
29.4.0.0.1 Basic Alternation
Let’s start with a basic example to illustrate how alternation works:
string <- "I have an apple and a banana"
pattern <- "apple|banana"
str_extract(string, pattern)
## [1] "apple"
29.4.0.0.2 Order of Precedence
When using alternation, it’s important to keep in mind the order of precedence rules. In general, the first pattern that matches the input string will be selected, and subsequent patterns will not be considered. Here’s an example to illustrate this:
string <- "I have a pineapple and an apple"
str_extract(string, pattern = "apple|pineapple")
## [1] "pineapple"
In this example, we have a string string
that contains the words “apple” and “pineapple”. We want to extract the first occurrence of either “apple” or “pineapple” from this text using a regular expression pattern that utilizes alternation. The pattern apple|pineapple
means “match ‘apple’ OR ‘pineapple’”. However, since the input string contains “pineapple” before “apple”, the str_extract()
function selects the first matching string “pineapple”.
29.4.0.0.3 Grouping Alternatives
We can also use parentheses to group alternative patterns together. This can be useful for specifying more complex patterns. Example:
string <- "Apple and pineapples are good for health"
pattern <- "(apple|banana|cherry) (and|or) (pineapple|kiwi|mango)"
str_view(string, regex(pattern, ignore_case = TRUE))
## [1] │ <Apple and pineapple>s are good for health
In above examples, we have used stringr::regex()
to modify regex flag to ignore cases while matching.
29.5 Anchors
Anchors in regular expressions allow you to match patterns at specific positions within the input string. In R, you can use various anchors in your regular expressions to match the beginning, end, or specific positions within the input text.
29.5.1 Beginning and End Anchors
The beginning anchor ^
and end anchor $
are used to match patterns at the beginning or end of the input string, respectively. Example
string <- "The quick brown fox jumps over the lazy dog. The fox is brown."
pattern <- "^the"
str_view(string, regex(pattern, ignore_case = TRUE))
## [1] │ <The> quick brown fox jumps over the lazy dog. The fox is brown.
In the above example, we are matching word the
which is at the beginning of a sentence only.
29.5.2 Word Boundary Anchors
The word boundary anchor \\b
is used to match patterns at the beginning or end of a word within the input string. Example
string <- 'Apple and pineapple, both are good for health'
pattern <- '\\bapple\\b'
str_view(string, regex(pattern, ignore_case = TRUE))
## [1] │ <Apple> and pineapple, both are good for health
In the above example, though apple
string is contained in another word pineapple
we are limiting our search for whole words only.
29.6 Capture Groups
A capture group is a way to group a part of a regular expression and capture it as a separate sub-string. This can be useful when you want to extract or replace a specific part of a string. In R, capture groups are denoted by parentheses ()
. Anything inside the parentheses is captured and can be referenced later in the regular expression or in the replacement string.
One use of capturing group is to refer back to it within a match with back reference: \1
refers to the match contained in the first parenthesis, \2
in the second parenthesis, and so on.
Example-1
my_fruits <- c('apple', 'banana', 'coconut', 'berry', 'cucumber', 'date')
# search for repeated alphabet
pattern <- '(.)\\1'
str_view(my_fruits, regex(pattern), match = NA)
## [1] │ a<pp>le
## [2] │ banana
## [3] │ coconut
## [4] │ be<rr>y
## [5] │ cucumber
## [6] │ date
Example-2
# search for repeated pair of alphabets
pattern <- '(..)\\1'
str_view(my_fruits, regex(pattern), match = NA)
## [1] │ apple
## [2] │ b<anan>a
## [3] │ <coco>nut
## [4] │ berry
## [5] │ <cucu>mber
## [6] │ date
Another way to use capturing group is, when we want to replace the pattern with something else. It is better to understand this with the following example-
# We have names in last_name, first_name format
names <- c('Hanks, Tom', 'Affleck, Ben', 'Damon, Matt')
str_view(names)
## [1] │ Hanks, Tom
## [2] │ Affleck, Ben
## [3] │ Damon, Matt
# Using this regex, we can convert these to first_name last_name format
str_replace_all(names, '(\\w+),\\s+(\\w+)', '\\2 \\1')
## [1] "Tom Hanks" "Ben Affleck" "Matt Damon"
29.7 Lookarounds
Look-ahead and look-behinds are zero-width assertions in regex. They are used to match a pattern only if it is followed or preceded by another pattern, respectively. The pattern in the look-ahead or look-behind is not included in the match.
29.7.1 Lookahead
A look-ahead is used to match a pattern only if it is followed by another pattern. Positive Lookaheads are written as (?=...)
, where ...
is the pattern that must follow the match.
For example, the regex pattern hello(?= world)
matches “hello” only if it is followed by ” world”. It matches “hello world” but not “hello there world” or “hello”.
Example
string <- c("hello world", "hello there world", "hello")
str_view(string, "hello(?= world)", match = NA)
## [1] │ <hello> world
## [2] │ hello there world
## [3] │ hello
# Note that "world" is not included in the match
29.7.2 Lookbehind
A look-behind is used to match a pattern only if it is preceded by another pattern. Look-behinds are written as (?<=...)
, where ...
is the pattern that must precede the match.
For example, the regex pattern (?<=hello )world
matches “world” only if it is preceded by “hello”. It matches “hello world” but not “world hello” or “hello there world”.
Example
string <- c("hello world", "world hello", "hello there world")
str_view(string, "(?<=hello )world", match = NA)
## [1] │ hello <world>
## [2] │ world hello
## [3] │ hello there world
29.7.3 Negative Lookahead and Lookbehinds
Negative look-ahead and negative look-behinds are used to match a pattern only if it is not followed or preceded by another pattern, respectively. Negative look-ahead and look-behinds are written as (?!...)
and (?<!...)
, respectively.
For example, the regex pattern hello(?! world)
matches “hello” only if it is not followed by ” world”. It matches “hello there” but not “hello world” or “hello world there”.
Example-
string <- c("hello there", "hello world", "hello world there")
str_view(string, "hello(?! world)", match = NA)
## [1] │ <hello> there
## [2] │ hello world
## [3] │ hello world there
And the regex pattern (?<!hello )world
matches “world” only if it is not preceded by “hello”. It matches “world hello” and “hello there world” but not “hello world”.
string <- c("hello world", "world hello", "hello there world")
str_view(string, "(?<!hello )world", match = NA)
## [1] │ hello world
## [2] │ <world> hello
## [3] │ hello there <world>
While the difference between the look-ahead and look-behind may be subtle, yet these become clear when string/pattern replacement or extraction is required.
Examples-
string <- "I have 10 apples, 6 pineapples and 5 bananas"
# look-behind to match "apples" preceded by a digit and a space
pattern1 <- "(?<=\\d\\s)apples"
# look-ahead to match count of apples
pattern2 <- "\\d+(?=\\sapple)"
str_view(string = string, pattern = pattern1, match = NA)
## [1] │ I have 10 <apples>, 6 pineapples and 5 bananas
str_view(string = string, pattern = pattern2, match = NA)
## [1] │ I have <10> apples, 6 pineapples and 5 bananas
29.8 Comments
29.8.1 Comments within regex
We can use the # character to add comments within a regex pattern. Any text following the
#
symbol on a line is ignored by the regex engine and treated as a comment. This can be useful for documenting your regex patterns or temporarily disabling parts of a pattern for testing or debugging. Example-29.8.2 Verbose Mode (multi-line comments)
In regular expressions, verbose mode is a feature that allows you to write more readable and maintainable regex patterns by adding comments and white-space without affecting their behavior. To enable verbose mode, we can use the
(?x)
or(?verbose)
modifier at the beginning of your regex pattern.Example - Using this regex we can extract words that contain a vowel at third place.