35 Finding string similarity

Comparison of two (or more) numeric fields is an easy job in the sense that we can use multiple statistical methods available to measure comparison between these. On the other hand, comparing strings in any way, shape or form is not a trivial task. Despite this complexity, comparing text strings is a common and fundamental task in many text-processing algorithms. Basic objective of all string similarity algorithms are to quantify the similarity between two text strings in terms of string metrics.

The fuzzy matching problems are to input two strings and return a score quantifying the likelihood that they are expressions of the same entity. So (Geeta and Gita) should get a high score but not (Apple and Microsoft). Over several decades, various algorithms for fuzzy string matching have emerged. They have varying strengths and weaknesses. These fall into two broad categories: lexical matching and phonetic matching.

35.1 Lexical matching

Lexical matching algorithms match two strings based on some model of errors. Typically they are meant to match strings that differ due to spelling or typing errors. Consider Atharv and ahtarv. A lexical matching algorithm would pick up that ht is a transposition of th. Such transposition errors are common. Given this, and that the rest of the two strings match exactly and are long enough, we should score this match as high.

Normally, algorithms to find lexical matching, can be classified into ‘edit distance based’ or ‘token based’.

35.1.1 Levenshtein algorithm

It is named after Vladimir Levenshtein, who considered this distance in 1965. The Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

For the two words helo and hello, it is obvious that there is a missing character "l". Thus to transform the word helo to hello all we need to do is insert that character. The distance, in this case, is 1 because there is only one edit needed.

35.1.2 Hamming distance

This distance is computed by overlaying one string over another and finding the places where the strings vary. Note, classical implementation was meant to handle strings of same length. Some implementations may bypass this by adding a padding at prefix or suffix. Nevertheless, the logic is to find the total number of places one string is different from the other.

35.1.3 Jaro-Winkler

This algorithms gives high scores to two strings if,

they contain same characters, but within a certain distance from one another, and
the order of the matching characters is same.

To be exact, the distance of finding similar character is 1 less than half of length of longest string. So if longest strings has length of 5, a character at the start of the string 1 must be found before or on ((5/2)–1) ~ 2nd position in the string 2 to be considered valid match. Because of this, the algorithm is directional and gives high score if matching is from the beginning of the strings.

35.1.4 Q-Gram

Q-Grams is based on the difference between occurences of Q consecutive characters in two strings. To illustrate take a case of Q=3 (this special case is also called trigrams). For atharv and its possible typo ahtarv the trigrams will be

For atharv {ath tha har arv}
for ahtarv {aht hta tar arv}

We can see that a total of 7 unique trigrams have been formed and out of these only 1 is similar. Thus, 3-gram similarility would be 1/7=14%. We can see that this algorithm is not very effective for transpositions.

35.2 Phonetic matching

Phonetic matching algorithms match strings based on how similar they sound. Consider Geeta and Gita. They sound similar enough that one person might spell as Geetha or Geeta, another as Gita. As in this case, one is not necessarily a misspelling of the other. just sounds similar.

35.2.1 Soundex

Created by Robert Russel and Margaret King Odell in 1918, this algorithm intended to match names and surnames based on the basic rules of English pronunciation, hence, similar names get the same value.

35.2.2 Metaphone

Developed by Lawrence Philips in 1990, the Metaphone is also more accurate compared with the Soundex method as it takes into consideration the groups of letters. The disadvantage shows up when we apply it to reconcile the strings that are not in English, as it is based on the rules of English pronunciation.

35.2.3 Double Metaphone

Following Metaphone, Philips also designed the Double Metaphone. As its name suggests, it returns two codes, so you have more chances to match the items, however, at the same time, it means a higher probability of an error. According to the algorithm, there are three matching levels:

primary key to the primary key = strongest match,
secondary key to the primary key = normal match,
secondary key against the secondary key = weakest match.

35.2.4 Metaphone 3

Philips further refined the double metaphone algorithm to produce better results. The algorithm (Metaphone 3) is however, proprietary and is not open-source.

35.3 Examples

In R, we can use stringdist package to calculate many of the above mentioned distances. The function is vectorised. The synatx is

stringdist(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)

where -

a and b are two strings/vectors for which similarity/distance is to be measured.
method to be used. Default is
- osa for Optimal String Alignment. Other methods are-
- lv for Levenstein distance,
- dl for Damerau-Levenshtein
- hamming for Hamming distance
- lcs for Longest Common Substring
- qgram for Q-Grams
- cosine for cosine
- jaccard for Jaccard’s algorithm
- jw for Jaro-Winkler
- soundex for Soundex
Other arguments are needed on the basis of algorithm chosen.

To calculate ‘metaphone’ index we can use phonics package and for ‘Double Metaphone’ we can use PGRdup package in R.

Example - Suppose we have a set of two names.

nameset1 <- c('Geeta', 'Susheel', 'Ram', 'Dr. Suchitra')
nameset2 <- c('Gita', 'Sushil', 'Rama', 'Suchitra')

Note most of these distances/similarity indices are cases sensitive, and therefore we have to use these methods with a bit cleaning first. We can convert cases of all strings to lower-case to eliminate these (if) unwanted errors.

library(stringdist)
suppressPackageStartupMessages(library(dplyr))

data.frame(
  nameset1 = tolower(nameset1),
  nameset2 = tolower(nameset2)
) %>% 
  mutate(lv_dist = stringdist(nameset1, nameset2, method = 'lv'),
         jw_dist = stringdist(nameset1, nameset2, method = 'jw'),
         qgram_3 = stringdist(nameset1, nameset2, method = 'qgram', q=3))

##       nameset1 nameset2 lv_dist    jw_dist qgram_3
## 1        geeta     gita       2 0.21666667       5
## 2      susheel   sushil       2 0.15079365       5
## 3          ram     rama       1 0.08333333       1
## 4 dr. suchitra suchitra       4 0.25694444       4

Creating Metaphone and Double Metaphone

library(phonics)

data.frame(
  nameset1 = tolower(nameset1),
  nameset2 = tolower(nameset2)
) %>% 
  mutate(metaphone_1 = metaphone(nameset1),
         metaphone_2 = metaphone(nameset2))

##       nameset1 nameset2 metaphone_1 metaphone_2
## 1        geeta     gita          JT          JT
## 2      susheel   sushil         SXL         SXL
## 3          ram     rama          RM          RM
## 4 dr. suchitra suchitra        <NA>        SXTR

Note that we cannot calculate metaphone for special characters even for spaces.

Double metaphone is not vectorised. So we have to use apply family of functions here.

suppressPackageStartupMessages(library(PGRdup))
library(purrr)

data.frame(
  nameset1 = tolower(nameset1),
  nameset2 = tolower(nameset2)
) %>% 
  mutate(DMP_1_1 = map_chr(nameset1, ~DoubleMetaphone(.x)[[1]]),
         DMP_1_2 = map_chr(nameset1, ~DoubleMetaphone(.x)[[2]]),
         DMP_2_1 = map_chr(nameset2, ~DoubleMetaphone(.x)[[1]]),
         DMP_2_2 = map_chr(nameset2, ~DoubleMetaphone(.x)[[2]]))

##       nameset1 nameset2 DMP_1_1 DMP_1_2 DMP_2_1 DMP_2_2
## 1        geeta     gita      JT      KT      JT      KT
## 2      susheel   sushil     SXL     SXL     SXL     SXL
## 3          ram     rama      RM      RM      RM      RM
## 4 dr. suchitra suchitra    TRSX    TRSK    SXTR    SKTR

34 Visualising Text analytics through Wordcloud, etc.

Part-VIII: Geo computation in R