35 Finding string similarity
Comparison of two (or more) numeric fields is an easy job in the sense that we can use multiple statistical methods available to measure comparison between these. On the other hand, comparing strings in any way, shape or form is not a trivial task. Despite this complexity, comparing text strings is a common and fundamental task in many text-processing algorithms. Basic objective of all string similarity algorithms are to quantify the similarity between two text strings in terms of string metrics.
The fuzzy matching problems are to input two strings and return a score quantifying the likelihood that they are expressions of the same entity. So (Geeta
and Gita
) should get a high score but not (Apple
and Microsoft
). Over several decades, various algorithms for fuzzy string matching have emerged. They have varying strengths and weaknesses. These fall into two broad categories: lexical matching
and phonetic matching
.
35.1 Lexical matching
Lexical matching algorithms match two strings based on some model of errors. Typically they are meant to match strings that differ due to spelling or typing errors. Consider Atharv
and ahtarv
. A lexical matching algorithm would pick up that ht
is a transposition of th
. Such transposition errors are common. Given this, and that the rest of the two strings match exactly and are long enough, we should score this match as high.
Normally, algorithms to find lexical matching, can be classified into ‘edit distance based’ or ‘token based’.
35.1.1 Levenshtein algorithm
It is named after Vladimir Levenshtein, who considered this distance in 1965. The Levenshtein distance
between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.
For the two words helo
and hello
, it is obvious that there is a missing character "l"
. Thus to transform the word helo
to hello
all we need to do is insert that character. The distance, in this case, is 1
because there is only one edit needed.
35.1.2 Hamming distance
This distance is computed by overlaying one string over another and finding the places where the strings vary. Note, classical implementation was meant to handle strings of same length. Some implementations may bypass this by adding a padding at prefix or suffix. Nevertheless, the logic is to find the total number of places one string is different from the other.
35.1.3 Jaro-Winkler
This algorithms gives high scores to two strings if,
- they contain same characters, but within a certain distance from one another, and
- the order of the matching characters is same.
To be exact, the distance of finding similar character is 1 less than half of length of longest string. So if longest strings has length of 5, a character at the start of the string 1 must be found before or on ((5/2)–1) ~ 2nd position in the string 2 to be considered valid match. Because of this, the algorithm is directional and gives high score if matching is from the beginning of the strings.
35.1.4 Q-Gram
Q-Grams is based on the difference between occurences of Q
consecutive characters in two strings. To illustrate take a case of Q=3 (this special case is also called trigrams). For atharv
and its possible typo ahtarv
the trigrams will be
- For atharv
{ath tha har arv}
- for ahtarv
{aht hta tar arv}
We can see that a total of 7
unique trigrams have been formed and out of these only 1
is similar. Thus, 3-gram similarility would be 1/7=14%
. We can see that this algorithm is not very effective for transpositions.
35.2 Phonetic matching
Phonetic matching algorithms match strings based on how similar they sound. Consider Geeta
and Gita.
They sound similar enough that one person might spell as Geetha
or Geeta
, another as Gita.
As in this case, one is not necessarily a misspelling of the other. just sounds similar.
35.2.1 Soundex
Created by Robert Russel and Margaret King Odell in 1918, this algorithm intended to match names and surnames based on the basic rules of English pronunciation, hence, similar names get the same value.
35.2.2 Metaphone
Developed by Lawrence Philips in 1990, the Metaphone is also more accurate compared with the Soundex
method as it takes into consideration the groups of letters. The disadvantage shows up when we apply it to reconcile the strings that are not in English, as it is based on the rules of English pronunciation.
35.2.3 Double Metaphone
Following Metaphone
, Philips also designed the Double Metaphone. As its name suggests, it returns two codes, so you have more chances to match the items, however, at the same time, it means a higher probability of an error. According to the algorithm, there are three matching levels:
-
primary key to the primary key = strongest match
, -
secondary key to the primary key = normal match
, -
secondary key against the secondary key = weakest match
.
35.3 Examples
In R, we can use stringdist
package to calculate many of the above mentioned distances. The function is vectorised. The synatx is
stringdist(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)
where -
-
a
andb
are two strings/vectors for which similarity/distance is to be measured. -
method
to be used. Default is-
osa
for Optimal String Alignment. Other methods are- -
lv
for Levenstein distance, -
dl
for Damerau-Levenshtein -
hamming
for Hamming distance -
lcs
for Longest Common Substring -
qgram
for Q-Grams -
cosine
for cosine -
jaccard
for Jaccard’s algorithm -
jw
for Jaro-Winkler -
soundex
for Soundex
-
- Other arguments are needed on the basis of algorithm chosen.
To calculate ‘metaphone’ index we can use phonics
package and for ‘Double Metaphone’ we can use PGRdup
package in R.
Example - Suppose we have a set of two names.
nameset1 <- c('Geeta', 'Susheel', 'Ram', 'Dr. Suchitra')
nameset2 <- c('Gita', 'Sushil', 'Rama', 'Suchitra')
Note most of these distances/similarity indices are cases sensitive, and therefore we have to use these methods with a bit cleaning first. We can convert cases of all strings to lower-case to eliminate these (if) unwanted errors.
library(stringdist)
suppressPackageStartupMessages(library(dplyr))
data.frame(
nameset1 = tolower(nameset1),
nameset2 = tolower(nameset2)
) %>%
mutate(lv_dist = stringdist(nameset1, nameset2, method = 'lv'),
jw_dist = stringdist(nameset1, nameset2, method = 'jw'),
qgram_3 = stringdist(nameset1, nameset2, method = 'qgram', q=3))
## nameset1 nameset2 lv_dist jw_dist qgram_3
## 1 geeta gita 2 0.21666667 5
## 2 susheel sushil 2 0.15079365 5
## 3 ram rama 1 0.08333333 1
## 4 dr. suchitra suchitra 4 0.25694444 4
Creating Metaphone and Double Metaphone
library(phonics)
data.frame(
nameset1 = tolower(nameset1),
nameset2 = tolower(nameset2)
) %>%
mutate(metaphone_1 = metaphone(nameset1),
metaphone_2 = metaphone(nameset2))
## nameset1 nameset2 metaphone_1 metaphone_2
## 1 geeta gita JT JT
## 2 susheel sushil SXL SXL
## 3 ram rama RM RM
## 4 dr. suchitra suchitra <NA> SXTR
Note that we cannot calculate metaphone
for special characters even for spaces.
Double metaphone is not vectorised. So we have to use apply
family of functions here.
suppressPackageStartupMessages(library(PGRdup))
library(purrr)
data.frame(
nameset1 = tolower(nameset1),
nameset2 = tolower(nameset2)
) %>%
mutate(DMP_1_1 = map_chr(nameset1, ~DoubleMetaphone(.x)[[1]]),
DMP_1_2 = map_chr(nameset1, ~DoubleMetaphone(.x)[[2]]),
DMP_2_1 = map_chr(nameset2, ~DoubleMetaphone(.x)[[1]]),
DMP_2_2 = map_chr(nameset2, ~DoubleMetaphone(.x)[[2]]))
## nameset1 nameset2 DMP_1_1 DMP_1_2 DMP_2_1 DMP_2_2
## 1 geeta gita JT KT JT KT
## 2 susheel sushil SXL SXL SXL SXL
## 3 ram rama RM RM RM RM
## 4 dr. suchitra suchitra TRSX TRSK SXTR SKTR