30 Regex in human readble format using `rebus`

Regular expressions, as explained in earlier chapter, are very powerful. However they are often difficult to interpret. There is a package called rebus in R, which allows us to build complex regular expressions from human readable expressions. So instead of writing [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4} and later trying to decipher what this expression actually meant, we can use a more human-readble format -

one_or_more(char_class(ASCII_ALNUM %R% "._%+-")) %R%
  "@" %R%
  one_or_more(char_class(ASCII_ALNUM %R% ".-")) %R%
  DOT %R%
  ascii_alpha(2, 4)

Many of us could have by now correctly guessed that these both are regular expressions to detect email addresses from the given text strings. Rebus actually contains functions like char_class() or one_or_more() to make building regular expressions easier. So let’s dive in to learn the package.

First of all let’s load the library. Alongside let’s also load stringr so that we can use its function str_view to understand the examples. Readers may note here that this library rebus also contains a function regex which creates a conflict with stringr::regex. So to avoid that conflict we may use library conflicted here which may set our preferential library in case of conflicts.

library(rebus)
library(stringr)
library(conflicted)
conflict_prefer("regex", "stringr")
conflicts_prefer(rebus::or)

30.1 Operators for concatenation and alternation

For concatenating two regular expressions we may use either of the operators %R% or %c% which like a pipe operator concatenates LHS and RHS together as one regular expression.

For alternation we may use either the operator %|% or function or() from the package which separates two regular expressions through a pipe for alternation. Example-1:

my_colors <- c("red", "grey", "blue", "gray")
str_view(my_colors, "gr" %R% or("a" , "e") %R% "y")

## [2] │ <grey>
## [4] │ <gray>

# Note operator precedence
# This actually means either "gra" or "ey"
str_view(my_colors, "gr" %R% "a" %|% "e" %R% "y")

## [2] │ gr<ey>
## [4] │ <gra>y

30.2 Literal Characters, specvial characters and case sensitivity

The literal characters may, as we learnt earlier, may be given in the string as they mean. Literals are case sensitive too. So if we want a case insensitive match, there is a function case_insensitive here which will take care of the requirement.

Example-2:

ex_text <- "This is an example text."
# Match literal `x`
str_view(ex_text, "X", match = NA)

## [1] │ This is an example text.

str_view(ex_text, case_insensitive("X"), match = NA)

## [1] │ This is an e<x>ample te<x>t.

To match special characters which may have special meaning in regular expressions, we have constants here. In rebus, the constants are usually available in UPPER CASE and equivalent functions are available in lower case. The following special character constants are available -

BACKSLASH

## <regex> \\

CARET

## <regex> \^

DOLLAR

## <regex> \$

DOT

## <regex> \.

PIPE

## <regex> \|

QUESTION

## <regex> \?

STAR

## <regex> \*

PLUS

## <regex> \+

OPEN_PAREN

## <regex> \(

CLOSE_PAREN

## <regex> \)

OPEN_BRACKET

## <regex> \[

CLOSE_BRACKET

## <regex> \]

OPEN_BRACE

## <regex> \{

Example-3:

str_view(ex_text, DOT)

## [1] │ This is an example text<.>

30.3 Metacharacters

30.3.1 Character classes

To group characters together in a class to match any of them, we may use function char_class() in rebus package. See example-4:

ex_vec <- c("Apple", "Orange", "Myrrh")
# matches a vowel
str_view(ex_vec, char_class("aeiou"))

## [1] │ Appl<e>
## [2] │ Or<a>ng<e>

# matches a vowel irrespective of case
str_view(ex_vec, case_insensitive(char_class("aeiou")))

## [1] │ <A>ppl<e>
## [2] │ <O>r<a>ng<e>

To match a range of characters/numbers we separate these by hyphen in square brackets (in normal regex building). So, char_class("a-d") will match a character from range [abcd].

Example-5:

ex_text <- "The quick brown fox jumps over the lazy dog."
# Match a, b or c in lower case
str_view(ex_text, char_class("a-d"))

## [1] │ The qui<c>k <b>rown fox jumps over the l<a>zy <d>og.

For negated character classes we again have an intuitively named function negated_char_class() in R, which we can use as per our requirement.

Example-6:

ex_text <- "The quick brown fox jumps over the lazy dog."
# Match all text except vowels
str_view(ex_text, negated_char_class("aeiou"))

## [1] │ <T><h>e< ><q>ui<c><k>< ><b><r>o<w><n>< ><f>o<x>< ><j>u<m><p><s>< >o<v>e<r>< ><t><h>e< ><l>a<z><y>< ><d>o<g><.>

# Note that upper case and dot character have also been matched.

30.3.2 Built-in Character classes

We can also use pre-built character classes available in rebus, as listed below.

ALPHA

## <regex> [:alpha:]

ALNUM

## <regex> [:alnum:]

BLANK

## <regex> [:blank:]

DIGIT

## <regex> [:digit:]

LOWER

## <regex> [:lower:]

PRINT

## <regex> [:print:]

PUNCT

## <regex> [:punct:]

SPACE

## <regex> [:space:]

UPPER

## <regex> [:upper:]

HEX_DIGIT

## <regex> [:xdigit:]

ANY_CHAR

## <regex> .

GRAPHEME

## <regex> \X

NEWLINE

## <regex> \R

DGT

## <regex> \d

WRD

## <regex> \w

SPC

## <regex> \s

NOT_DGT

## <regex> \D

NOT_WRD

## <regex> \W

NOT_SPC  # Equivalent to "\\S"

## <regex> \S

ASCII_DIGIT

## <regex> 0-9

ASCII_LOWER

## <regex> a-z

ASCII_UPPER

## <regex> A-Z

ASCII_ALPHA

## <regex> a-zA-Z

ASCII_ALNUM

## <regex> a-zA-Z0-9

See another example.

Example-7:

ex_text <- "The quick brown fox jumps over the lazy dog."
# Match TAB or SPACE Characters
str_view(ex_text, BLANK)

## [1] │ The< >quick< >brown< >fox< >jumps< >over< >the< >lazy< >dog.

# Match all UPPER CASE characters
str_view(ex_text, UPPER)

## [1] │ <T>he quick brown fox jumps over the lazy dog.

Besides the afore-mentioned class constants, we have lower case equivalent functions for these character classes for a more useful regex building.

alnum(lo, hi, char_class = TRUE)
alpha(lo, hi, char_class = TRUE)
blank(lo, hi, char_class = TRUE)
cntrl(lo, hi, char_class = TRUE)
digit(lo, hi, char_class = TRUE)
graph(lo, hi, char_class = TRUE)
lower(lo, hi, char_class = TRUE)
printable(lo, hi, char_class = TRUE)
punct(lo, hi, char_class = TRUE)
space(lo, hi, char_class = TRUE)
upper(lo, hi, char_class = TRUE)
hex_digit(lo, hi, char_class = TRUE)
any_char(lo, hi)
grapheme(lo, hi)
newline(lo, hi)
dgt(lo, hi, char_class = FALSE)
wrd(lo, hi, char_class = FALSE)
spc(lo, hi, char_class = FALSE)
not_dgt(lo, hi, char_class = FALSE)
not_wrd(lo, hi, char_class = FALSE)
not_spc(lo, hi, char_class = FALSE)
ascii_digit(lo, hi, char_class = TRUE)
ascii_lower(lo, hi, char_class = TRUE)
ascii_upper(lo, hi, char_class = TRUE)
ascii_alpha(lo, hi, char_class = TRUE)
ascii_alnum(lo, hi, char_class = TRUE)
char_range(lo, hi, char_class = lo < hi)
number_range(lo, hi, allow_leading_zeroes = FALSE, capture = FALSE)

In the above functions, both lo and hi accept positive integers as quantifiers; and char_class argument a logical value. See these examples.

Example-8:

ip_add <- "My IP address is 255.1.2.50"
str_view(ip_add, digit(1, 3))

## [1] │ My IP address is <255>.<1>.<2>.<50>

str_view(ip_add, digit(3))

## [1] │ My IP address is <255>.1.2.50

str_view(ip_add, not_dgt(5))

## [1] │ <My IP>< addr><ess i>s 255.1.2.50

# Note this will match none
str_view(ip_add, space(2))

30.3.3 Word Boundaries

To match a word boundary (or its negation) we have BOUNDARY (and NOT_BOUNDARY) in rebus. Function whole_word(x), on the other hand, wraps the regex in word boundary tokens to match a whole word. See following example.

Example-9:

ex_text <- "The thermometer, they were searching is placed in a leather box."
# Note three matches
str_view(ex_text, "the")

## [1] │ The <the>rmometer, <the>y were searching is placed in a lea<the>r box.

# There is no match
str_view(ex_text, whole_word("the"), match = NA)

## [1] │ The thermometer, they were searching is placed in a leather box.

str_view(ex_text, case_insensitive(whole_word("the")), match = NA)

## [1] │ <The> thermometer, they were searching is placed in a leather box.

str_view(ex_text, BOUNDARY %R% case_insensitive("the"))

## [1] │ <The> <the>rmometer, <the>y were searching is placed in a leather box.

str_view(ex_text, NOT_BOUNDARY %R% "the")

## [1] │ The thermometer, they were searching is placed in a lea<the>r box.

30.4 Quantifiers

We learnt following quantifiers in regular expressions.

+ 1 or more occurrences
* 0 or more
? 0 or 1
{} specified numbers
- {n} exactly n
- {n,} n or more
- {n,m} between n and m

We have meaningfully named functions for each of the above quantifiers in rebus.

one_or_more(x, char_class = NA)
zero_or_more(x, char_class = NA)
optional(x, char_class = NA)
repeated(x, lo, hi, lazy = FALSE, char_class = NA)
- where lo and hi represent n and m equivalently.

Additionally, we may notice an argument char_class = NA which accepts a logical value for case when x is required to be wrapped in a character class.

We may match two consecutive vowels using repeated("aeiou", 2, char_class = TRUE). See this example.

Example-10:

ex_vec <- c("1 apple", "2 bananas", "3 pineapples")
# match two consecutive vowels
str_view(ex_vec, repeated("aeiou", 2, char_class = TRUE))

## [3] │ 3 pin<ea>pples

# match a number followed by apple with an optional space
str_view(ex_vec, one_or_more(DIGIT) %R% optional(BLANK) %R% "apple")

## [1] │ <1 apple>

30.5 Anchors

Anchors in regular expressions allow us to match patterns at specific positions within the input string. In rebus, we have constants START and END to match the beginning or end positions within the input text, respectively.

Example-11:

string <- "The quick brown fox jumps over the lazy dog. The fox is brown."
str_view(string, START %R% case_insensitive("the"))

## [1] │ <The> quick brown fox jumps over the lazy dog. The fox is brown.

In the above example, we are matching word the which is at the beginning of a sentence only. There is one more function exactly(x) in rebus which makes the regular expression match the whole string, from start to end, in fact a shortcut of START %R% x %R% END.

Example-12:

ex_vec <- c("apple", "banana", "cherry", "pineapple")
str_view(ex_vec, exactly("apple"))

## [1] │ <apple>

30.6 Capture Groups

We have seen that a capture group is a way to group a part of a regular expression and capture it as a separate substring. This is useful when we want to extract or replace a specific part of a string. In R, capture groups are denoted by parentheses (). Anything inside the parentheses is captured and can be referenced later in the regular expression or in the replacement string. We also learnt that capturing groups are useful to refer back to it within a match with back reference: \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, and so on.

In rebus, there are two functions capture(x) and group(x) for capturing regex. Former is good with match functions while latter is mostly used with alternations.

Example-13:

## capture(x)
my_fruits <- c('apple', 'banana', 'coconut', 'berry', 'cucumber', 'date')
str_remove_all(my_fruits, capture(char_class("aeiou")))

## [1] "ppl"   "bnn"   "ccnt"  "brry"  "ccmbr" "dt"

## group()
my_toppings <- group("olive" %|% "mushroom" %|% "tomato")
pizza <- "We have olive, mushroom, chicken and capsicum pizza in our menu."
# Extract my favourite topping from available pizza menu.
str_extract_all(pizza, pattern = my_toppings)

## [[1]]
## [1] "olive"    "mushroom"

Backreferences for replacement operations, are denoted by constants REF# where # is actually a digit from 1 to 9.

Example-14:

my_fruits <- c('apple', 'banana', 'coconut', 'berry', 'cucumber', 'date')
# search for repeated alphabet
str_view(my_fruits, capture(ANY_CHAR) %R% REF1 , match = NA)

## [1] │ a<pp>le
## [2] │ banana
## [3] │ coconut
## [4] │ be<rr>y
## [5] │ cucumber
## [6] │ date

Similar to previous chapter, we can use capturing group to replace the pattern with something else.

Example-15:

# We have names in last_name, first_name format
names <- c('Hanks, Tom', 'Affleck, Ben', 'Damon, Matt')
str_view(names)

## [1] │ Hanks, Tom
## [2] │ Affleck, Ben
## [3] │ Damon, Matt

# Pattern to capture first name and last name
pat <- capture(whole_word(one_or_more(ALPHA))) %R% "," %R% optional(SPACE) %R% capture(whole_word(one_or_more(ALPHA)))
repl <- REF2 %R% " " %R% REF1
# Using this regex, we can convert these to first_name last_name format
str_replace_all(names, pat, repl)

## [1] "Tom Hanks"   "Ben Affleck" "Matt Damon"

30.7 Lookarounds

As we had learnt, lookaheads and lookbehinds are zero-width assertions in regex. They are used to match a pattern only if it is followed or preceded by another pattern, respectively. The pattern in the lookahead or lookbehind is not included in the match.

We have intuitively named functions to deal with all four lookarounds-

lookahead(x)
negative_lookahead(x)
lookbehind(x)
negative_lookbehind(x)

Example-16: Find character q not followed by u

countries <- c("mozambique", "qatar", "iraq")
# With lookahead
str_view(countries, "q" %R% negative_lookahead("u"))

## [2] │ <q>atar
## [3] │ ira<q>

# Without Lookahaed - Notice the difference
str_view(countries, "q" %R% negated_char_class("u"))

## [2] │ <qa>tar

Example-17:

# Lookahead
string <- c("hello world", "hello there world", "hello")
str_view(string, "hello" %R% lookahead(optional(SPACE) %R% "world"), match = NA)

## [1] │ <hello> world
## [2] │ hello there world
## [3] │ hello

# Note that "world" is not included in the match

# Lookbehind
str_view(string, lookbehind("hello" %R% optional(SPACE)) %R% "world", match = NA)

## [1] │ hello <world>
## [2] │ hello there world
## [3] │ hello

# Negative lookahead
str_view(string, "hello" %R% negative_lookahead(optional(SPACE) %R% "world"), match = NA)

## [1] │ hello world
## [2] │ <hello> there world
## [3] │ <hello>

# Negative lookbehind
str_view(string, negative_lookbehind("hello" %R% optional(SPACE)) %R% "world", match = NA)

## [1] │ hello world
## [2] │ hello there <world>
## [3] │ hello

More examples.

Example-18:

string <- "I have 10 apples, 6 pineapples and 5 bananas"

# lookahead to match count of apples
pattern1 <- one_or_more(DIGIT) %R% lookahead(optional(SPACE) %R% "apple") 
# How many apples?
str_view(string = string, pattern = pattern1, match = NA)

## [1] │ I have <10> apples, 6 pineapples and 5 bananas

30.8 Some useful regex functions

30.8.1 Matching valid dates

# Individual date-time components
DTSEP             # optional selected punctuation or space

## <regex> [-/.:,\ ]?

CENTURY           # exactly two digits

## <regex> [0-9]{2}

YEAR              # one to four digits

## <regex> [0-9]{1,4}

YEAR2             # exactly two digits

## <regex> [0-9]{2}

YEAR4             # exactly four digits

## <regex> [0-9]{4}

MONTH             # number from 1 to 12, leading zero

## <regex> (?:0[1-9]|1[0-2])

WEEK_OF_YEAR      # number from 0 to 53, leading zero

## <regex> (?:[0-4][0-9]|5[0-3])

DAY               # number from 1 to 31, leading zero

## <regex> (?:0[1-9]|[12][0-9]|3[01])

DAY_SINGLE        # leading space

## <regex> (?: [1-9]|[12][0-9]|3[01])

HOUR24            # 24 hour clock, leading zero

## <regex> (?:[01][0-9]|2[0-3])

HOUR12            # 12 hour clock, leading zero

## <regex> (?:0[1-9]|1[0-2])

HOUR24_SINGLE     # 24 hour clock, leading space

## <regex> (?:[ 1][0-9]|2[0-3])

HOUR12_SINGLE     # 12 hour clock, leading space

## <regex> (?: [1-9]|1[0-2])

MINUTE            # number from 0 to 59, leading zero

## <regex> [0-5][0-9]

SECOND            # number from 0 to 61 (leap seconds), leading zero

## <regex> (?:[0-5][0-9]|6[01])

FRACTIONAL_SECOND # a second optional decimal point and up to 6 digits

## <regex> (?:[0-5][0-9]|6[01])(?:[.,][0-9]{1,6})?

AM_PM             # AM or PM, any case

## <regex> (?:am|AM|pm|PM)

TIMEZONE_OFFSET   # optional plus or minus, then four digits

## <regex> [-+]?[0-9]{4}

TIMEZONE          # Any value returned by OlsonNames()

## <regex> (?:Africa/Abidjan|Africa/Accra|Africa/Addis_Ababa|Africa/Algiers|Africa/Asmara|Africa/Asmera|Africa/Bamako|Africa/Bangui|Africa/Banjul|Africa/Bissau|Africa/Blantyre|Africa/Brazzaville|Africa/Bujumbura|Africa/Cairo|Africa/Casablanca|Africa/Ceuta|Africa/Conakry|Africa/Dakar|Africa/Dar_es_Salaam|Africa/Djibouti|Africa/Douala|Africa/El_Aaiun|Africa/Freetown|Africa/Gaborone|Africa/Harare|Africa/Johannesburg|Africa/Juba|Africa/Kampala|Africa/Khartoum|Africa/Kigali|Africa/Kinshasa|Africa/Lagos|Africa/Libreville|Africa/Lome|Africa/Luanda|Africa/Lubumbashi|Africa/Lusaka|Africa/Malabo|Africa/Maputo|Africa/Maseru|Africa/Mbabane|Africa/Mogadishu|Africa/Monrovia|Africa/Nairobi|Africa/Ndjamena|Africa/Niamey|Africa/Nouakchott|Africa/Ouagadougou|Africa/Porto-Novo|Africa/Sao_Tome|Africa/Timbuktu|Africa/Tripoli|Africa/Tunis|Africa/Windhoek|America/Adak|America/Anchorage|America/Anguilla|America/Antigua|America/Araguaina|America/Argentina/Buenos_Aires|America/Argentina/Catamarca|America/Argentina/ComodRivadavia|America/Argentina/Cordoba|America/Argentina/Jujuy|America/Argentina/La_Rioja|America/Argentina/Mendoza|America/Argentina/Rio_Gallegos|America/Argentina/Salta|America/Argentina/San_Juan|America/Argentina/San_Luis|America/Argentina/Tucuman|America/Argentina/Ushuaia|America/Aruba|America/Asuncion|America/Atikokan|America/Atka|America/Bahia|America/Bahia_Banderas|America/Barbados|America/Belem|America/Belize|America/Blanc-Sablon|America/Boa_Vista|America/Bogota|America/Boise|America/Buenos_Aires|America/Cambridge_Bay|America/Campo_Grande|America/Cancun|America/Caracas|America/Catamarca|America/Cayenne|America/Cayman|America/Chicago|America/Chihuahua|America/Ciudad_Juarez|America/Coral_Harbour|America/Cordoba|America/Costa_Rica|America/Creston|America/Cuiaba|America/Curacao|America/Danmarkshavn|America/Dawson|America/Dawson_Creek|America/Denver|America/Detroit|America/Dominica|America/Edmonton|America/Eirunepe|America/El_Salvador|America/Ensenada|America/Fort_Nelson|America/Fort_Wayne|America/Fortaleza|America/Glace_Bay|America/Godthab|America/Goose_Bay|America/Grand_Turk|America/Grenada|America/Guadeloupe|America/Guatemala|America/Guayaquil|America/Guyana|America/Halifax|America/Havana|America/Hermosillo|America/Indiana/Indianapolis|America/Indiana/Knox|America/Indiana/Marengo|America/Indiana/Petersburg|America/Indiana/Tell_City|America/Indiana/Vevay|America/Indiana/Vincennes|America/Indiana/Winamac|America/Indianapolis|America/Inuvik|America/Iqaluit|America/Jamaica|America/Jujuy|America/Juneau|America/Kentucky/Louisville|America/Kentucky/Monticello|America/Knox_IN|America/Kralendijk|America/La_Paz|America/Lima|America/Los_Angeles|America/Louisville|America/Lower_Princes|America/Maceio|America/Managua|America/Manaus|America/Marigot|America/Martinique|America/Matamoros|America/Mazatlan|America/Mendoza|America/Menominee|America/Merida|America/Metlakatla|America/Mexico_City|America/Miquelon|America/Moncton|America/Monterrey|America/Montevideo|America/Montreal|America/Montserrat|America/Nassau|America/New_York|America/Nipigon|America/Nome|America/Noronha|America/North_Dakota/Beulah|America/North_Dakota/Center|America/North_Dakota/New_Salem|America/Nuuk|America/Ojinaga|America/Panama|America/Pangnirtung|America/Paramaribo|America/Phoenix|America/Port-au-Prince|America/Port_of_Spain|America/Porto_Acre|America/Porto_Velho|America/Puerto_Rico|America/Punta_Arenas|America/Rainy_River|America/Rankin_Inlet|America/Recife|America/Regina|America/Resolute|America/Rio_Branco|America/Rosario|America/Santa_Isabel|America/Santarem|America/Santiago|America/Santo_Domingo|America/Sao_Paulo|America/Scoresbysund|America/Shiprock|America/Sitka|America/St_Barthelemy|America/St_Johns|America/St_Kitts|America/St_Lucia|America/St_Thomas|America/St_Vincent|America/Swift_Current|America/Tegucigalpa|America/Thule|America/Thunder_Bay|America/Tijuana|America/Toronto|America/Tortola|America/Vancouver|America/Virgin|America/Whitehorse|America/Winnipeg|America/Yakutat|America/Yellowknife|Antarctica/Casey|Antarctica/Davis|Antarctica/DumontDUrville|Antarctica/Macquarie|Antarctica/Mawson|Antarctica/McMurdo|Antarctica/Palmer|Antarctica/Rothera|Antarctica/South_Pole|Antarctica/Syowa|Antarctica/Troll|Antarctica/Vostok|Arctic/Longyearbyen|Asia/Aden|Asia/Almaty|Asia/Amman|Asia/Anadyr|Asia/Aqtau|Asia/Aqtobe|Asia/Ashgabat|Asia/Ashkhabad|Asia/Atyrau|Asia/Baghdad|Asia/Bahrain|Asia/Baku|Asia/Bangkok|Asia/Barnaul|Asia/Beirut|Asia/Bishkek|Asia/Brunei|Asia/Calcutta|Asia/Chita|Asia/Choibalsan|Asia/Chongqing|Asia/Chungking|Asia/Colombo|Asia/Dacca|Asia/Damascus|Asia/Dhaka|Asia/Dili|Asia/Dubai|Asia/Dushanbe|Asia/Famagusta|Asia/Gaza|Asia/Harbin|Asia/Hebron|Asia/Ho_Chi_Minh|Asia/Hong_Kong|Asia/Hovd|Asia/Irkutsk|Asia/Istanbul|Asia/Jakarta|Asia/Jayapura|Asia/Jerusalem|Asia/Kabul|Asia/Kamchatka|Asia/Karachi|Asia/Kashgar|Asia/Kathmandu|Asia/Katmandu|Asia/Khandyga|Asia/Kolkata|Asia/Krasnoyarsk|Asia/Kuala_Lumpur|Asia/Kuching|Asia/Kuwait|Asia/Macao|Asia/Macau|Asia/Magadan|Asia/Makassar|Asia/Manila|Asia/Muscat|Asia/Nicosia|Asia/Novokuznetsk|Asia/Novosibirsk|Asia/Omsk|Asia/Oral|Asia/Phnom_Penh|Asia/Pontianak|Asia/Pyongyang|Asia/Qatar|Asia/Qostanay|Asia/Qyzylorda|Asia/Rangoon|Asia/Riyadh|Asia/Saigon|Asia/Sakhalin|Asia/Samarkand|Asia/Seoul|Asia/Shanghai|Asia/Singapore|Asia/Srednekolymsk|Asia/Taipei|Asia/Tashkent|Asia/Tbilisi|Asia/Tehran|Asia/Tel_Aviv|Asia/Thimbu|Asia/Thimphu|Asia/Tokyo|Asia/Tomsk|Asia/Ujung_Pandang|Asia/Ulaanbaatar|Asia/Ulan_Bator|Asia/Urumqi|Asia/Ust-Nera|Asia/Vientiane|Asia/Vladivostok|Asia/Yakutsk|Asia/Yangon|Asia/Yekaterinburg|Asia/Yerevan|Atlantic/Azores|Atlantic/Bermuda|Atlantic/Canary|Atlantic/Cape_Verde|Atlantic/Faeroe|Atlantic/Faroe|Atlantic/Jan_Mayen|Atlantic/Madeira|Atlantic/Reykjavik|Atlantic/South_Georgia|Atlantic/St_Helena|Atlantic/Stanley|Australia/ACT|Australia/Adelaide|Australia/Brisbane|Australia/Broken_Hill|Australia/Canberra|Australia/Currie|Australia/Darwin|Australia/Eucla|Australia/Hobart|Australia/LHI|Australia/Lindeman|Australia/Lord_Howe|Australia/Melbourne|Australia/NSW|Australia/North|Australia/Perth|Australia/Queensland|Australia/South|Australia/Sydney|Australia/Tasmania|Australia/Victoria|Australia/West|Australia/Yancowinna|Brazil/Acre|Brazil/DeNoronha|Brazil/East|Brazil/West|CET|CST6CDT|Canada/Atlantic|Canada/Central|Canada/Eastern|Canada/Mountain|Canada/Newfoundland|Canada/Pacific|Canada/Saskatchewan|Canada/Yukon|Chile/Continental|Chile/EasterIsland|Cuba|EET|EST|EST5EDT|Egypt|Eire|Etc/GMT|Etc/GMT\+0|Etc/GMT\+1|Etc/GMT\+10|Etc/GMT\+11|Etc/GMT\+12|Etc/GMT\+2|Etc/GMT\+3|Etc/GMT\+4|Etc/GMT\+5|Etc/GMT\+6|Etc/GMT\+7|Etc/GMT\+8|Etc/GMT\+9|Etc/GMT-0|Etc/GMT-1|Etc/GMT-10|Etc/GMT-11|Etc/GMT-12|Etc/GMT-13|Etc/GMT-14|Etc/GMT-2|Etc/GMT-3|Etc/GMT-4|Etc/GMT-5|Etc/GMT-6|Etc/GMT-7|Etc/GMT-8|Etc/GMT-9|Etc/GMT0|Etc/Greenwich|Etc/UCT|Etc/UTC|Etc/Universal|Etc/Zulu|Europe/Amsterdam|Europe/Andorra|Europe/Astrakhan|Europe/Athens|Europe/Belfast|Europe/Belgrade|Europe/Berlin|Europe/Bratislava|Europe/Brussels|Europe/Bucharest|Europe/Budapest|Europe/Busingen|Europe/Chisinau|Europe/Copenhagen|Europe/Dublin|Europe/Gibraltar|Europe/Guernsey|Europe/Helsinki|Europe/Isle_of_Man|Europe/Istanbul|Europe/Jersey|Europe/Kaliningrad|Europe/Kiev|Europe/Kirov|Europe/Kyiv|Europe/Lisbon|Europe/Ljubljana|Europe/London|Europe/Luxembourg|Europe/Madrid|Europe/Malta|Europe/Mariehamn|Europe/Minsk|Europe/Monaco|Europe/Moscow|Europe/Nicosia|Europe/Oslo|Europe/Paris|Europe/Podgorica|Europe/Prague|Europe/Riga|Europe/Rome|Europe/Samara|Europe/San_Marino|Europe/Sarajevo|Europe/Saratov|Europe/Simferopol|Europe/Skopje|Europe/Sofia|Europe/Stockholm|Europe/Tallinn|Europe/Tirane|Europe/Tiraspol|Europe/Ulyanovsk|Europe/Uzhgorod|Europe/Vaduz|Europe/Vatican|Europe/Vienna|Europe/Vilnius|Europe/Volgograd|Europe/Warsaw|Europe/Zagreb|Europe/Zaporozhye|Europe/Zurich|GB|GB-Eire|GMT|GMT\+0|GMT-0|GMT0|Greenwich|HST|Hongkong|Iceland|Indian/Antananarivo|Indian/Chagos|Indian/Christmas|Indian/Cocos|Indian/Comoro|Indian/Kerguelen|Indian/Mahe|Indian/Maldives|Indian/Mauritius|Indian/Mayotte|Indian/Reunion|Iran|Israel|Jamaica|Japan|Kwajalein|Libya|MET|MST|MST7MDT|Mexico/BajaNorte|Mexico/BajaSur|Mexico/General|NZ|NZ-CHAT|Navajo|PRC|PST8PDT|Pacific/Apia|Pacific/Auckland|Pacific/Bougainville|Pacific/Chatham|Pacific/Chuuk|Pacific/Easter|Pacific/Efate|Pacific/Enderbury|Pacific/Fakaofo|Pacific/Fiji|Pacific/Funafuti|Pacific/Galapagos|Pacific/Gambier|Pacific/Guadalcanal|Pacific/Guam|Pacific/Honolulu|Pacific/Johnston|Pacific/Kanton|Pacific/Kiritimati|Pacific/Kosrae|Pacific/Kwajalein|Pacific/Majuro|Pacific/Marquesas|Pacific/Midway|Pacific/Nauru|Pacific/Niue|Pacific/Norfolk|Pacific/Noumea|Pacific/Pago_Pago|Pacific/Palau|Pacific/Pitcairn|Pacific/Pohnpei|Pacific/Ponape|Pacific/Port_Moresby|Pacific/Rarotonga|Pacific/Saipan|Pacific/Samoa|Pacific/Tahiti|Pacific/Tarawa|Pacific/Tongatapu|Pacific/Truk|Pacific/Wake|Pacific/Wallis|Pacific/Yap|Poland|Portugal|ROC|ROK|Singapore|Turkey|UCT|US/Alaska|US/Aleutian|US/Arizona|US/Central|US/East-Indiana|US/Eastern|US/Hawaii|US/Indiana-Starke|US/Michigan|US/Mountain|US/Pacific|US/Samoa|UTC|Universal|W-SU|WET|Zulu)

# ISO 8601 formats
ISO_DATE          # %Y-%m-%d

## <regex> [0-9]{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12][0-9]|3[01])

ISO_TIME          # %H:%M:%S

## <regex> (?:[01][0-9]|2[0-3]):[0-5][0-9]:(?:[0-5][0-9]|6[01])

ISO_DATETIME      # ISO_DATE followed by ISO_TIME, separated by space or "T".

## <regex> [0-9]{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12][0-9]|3[01])(?:[ T](?:[01][0-9]|2[0-3]):[0-5][0-9]:(?:[0-5][0-9]|6[01]))?

# Compound forms, separated by DTSEP
YMD

## <regex> [0-9]{1,4}[-/.:,\ ]?(?:0[1-9]|1[0-2])[-/.:,\ ]?(?:0[1-9]|[12][0-9]|3[01])

YDM

## <regex> [0-9]{1,4}[-/.:,\ ]?(?:0[1-9]|[12][0-9]|3[01])[-/.:,\ ]?(?:0[1-9]|1[0-2])

MYD

## <regex> (?:0[1-9]|1[0-2])[-/.:,\ ]?[0-9]{1,4}[-/.:,\ ]?(?:0[1-9]|[12][0-9]|3[01])

MDY

## <regex> (?:0[1-9]|1[0-2])[-/.:,\ ]?(?:0[1-9]|[12][0-9]|3[01])[-/.:,\ ]?[0-9]{1,4}

DYM

## <regex> (?:0[1-9]|[12][0-9]|3[01])[-/.:,\ ]?[0-9]{1,4}[-/.:,\ ]?(?:0[1-9]|1[0-2])

DMY

## <regex> (?:0[1-9]|[12][0-9]|3[01])[-/.:,\ ]?(?:0[1-9]|1[0-2])[-/.:,\ ]?[0-9]{1,4}

HMS

## <regex> (?:[01][0-9]|2[0-3])[-/.:,\ ]?[0-5][0-9][-/.:,\ ]?(?:[0-5][0-9]|6[01])

HM

## <regex> (?:[01][0-9]|2[0-3])[-/.:,\ ]?[0-5][0-9]

MS

## <regex> [0-5][0-9][-/.:,\ ]?(?:[0-5][0-9]|6[01])

Example-19:

# We have some dates - both valid and invalid
some_dates <- c("2000-13-01", "2025-08-09","2000-01-32", "2000-00-01", "2000-01-00", "2020-05-20")

str_view(some_dates, ISO_DATE)

## [2] │ <2025-08-09>
## [6] │ <2020-05-20>

# Similarly some time formats
some_times <- c("24:00:00", "23:60:59", "23:59:62", "23 59 59", "23:55:55", "00:00:00")
str_view(some_times, ISO_TIME)

## [5] │ <23:55:55>
## [6] │ <00:00:00>

30.9 Roman numerals

To match Roman numerals we have a constant ROMAN as well as a function roman(lo, hi) in rebus.

Example-20:

# Some Roman numerals, both valid and invalid
some_numbers <- c("MMMDCCCXLVIII", "MMMCMDCCCXLVIIV", "MCD", "XIL", "LIX", "XL")
# Find valid roam numerals
str_view(some_numbers, exactly(roman()))

## [1] │ <MMMDCCCXLVIII>
## [3] │ <MCD>
## [5] │ <LIX>
## [6] │ <XL>

29 Regex - A quick introduction

31 Factors

30 Regex in human readble format using rebus