%pip install levenshtein
WARNING: The directory '/home/jovyan/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.

Collecting levenshtein
  Obtaining dependency information for levenshtein from https://files.pythonhosted.org/packages/1f/22/e41d35e733cd6c3175cbb23e2ce113fa1fc74ffb804165e425ea80a84554/Levenshtein-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading Levenshtein-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting rapidfuzz<4.0.0,>=2.3.0 (from levenshtein)
  Obtaining dependency information for rapidfuzz<4.0.0,>=2.3.0 from https://files.pythonhosted.org/packages/fc/5d/c42c05f1de4e604ff69d8c1ee06ee3dcd1b5bf00a396339ade378bb6d5e4/rapidfuzz-3.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading rapidfuzz-3.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading Levenshtein-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (172 kB)
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/172.5 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 172.5/172.5 kB 12.7 MB/s eta 0:00:00
?25h
Downloading rapidfuzz-3.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/3.1 MB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 134.9 MB/s eta 0:00:00
?25h
Installing collected packages: rapidfuzz, levenshtein
Successfully installed levenshtein-0.22.0 rapidfuzz-3.3.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Note: you may need to restart the kernel to use updated packages.

Sequence-based similarity#

Consider search engines that have indexes of text content. Entering a misspelled search term will match misspellings in the database, but modern search engines often suggest similar queries with larger number of hits in the database. How do they do it? Search engines (like elasticsearch) generate a wide range of possible misspellings of the query and present search results (or spelling suggestions) informed by queries for a comprehensive set of deletion- substitution-, insertion-, and transposition- modified words.

Edit distance (also known as Levenshtein distance) is another approach at fuzzy matching, inspired by the process of changing one input into another. It is more sensitive than Jaccard similarity scoring but requires more computations (that is to say, it takes more time to run when other factors are held constant).

Hamming distance#

If all of the symbols of interest are the same length, say, five-letter telegraph abbreviations, it is simple to count the number of differences letter-by-letter: ENBET differs in one position from ERBET and differs in two positions from TRBET.

This counting-of-differences distance metric goes by the name ‘’Hamming distance’’, but this is suited for vocabularies of aligned codewords all the same length.

Hamming distance would make sense, for instance, comparing dates in ISO-8601 format represented as strings, like “2020-03-15” and “2023-09-21”. Differences of a single character have Hamming distance one; transpositions of digits have Hamming distance 2.

Levenshtein (edit) distance#

If you think about the process of editing text, you can modify text by adding text, deleting text, or, if you are inclined, by both deleting and adding in about the same place. By counting the number of steps needed to transform string1 into string2 using these three operations, deletion, insertion, and substitution, we can define a number that describes how similar the two strings are. There are algorithms that will find the minimum number of edits that will transform one string into another (similar) string and this minimum number of edits score has many uses in data munging.

The Levenshtein distance counts the differences between two strings by describing the minimum number of steps needed to transform one string into the other, where the permitted steps are

  1. substitution (replace a character with a different one)

  2. insertion (add a character), and

  3. deletion.

This is the number of single-character edits needed to transform one string into another.

Levenshtein distance is fairly easy to illustrate by providing an example alignment:

The first string can be changed into the second with a three-character deletion (n't), a three-character substitution (she for you), and eight characters of insertions w and e to go.

The edit distance between the strings is the sum of the number of changes.

Search engines which index text can sometimes give reasonable results when given inexact search queries, but it is not without some engineering. Entering a misspelled search term will match misspellings in the database and will fail to find exact matches on the correct spelling. Modern search engines often suggest similar queries with larger number of hits in the database. How do they do it? Search engines (like elasticsearch) generate a wide range of possible misspellings of the query and present search results (or spelling suggestions) informed by queries for a comprehensive set of deletion- substitution-, insertion-, and transposition- modified words. Yes, you have to check the search index hundreds of times, but if you see a one- or two-letter different search term with hundreds of times the hits, you would do well to suggest it.

import Levenshtein
from Levenshtein import distance

The Levenshtein.distance function finds the minimum number of steps. For two completely different words, all the letters must be replaced:

distance("one", "six")
3

While for a word that is a subset of another, the missing letters must all be added:

distance("seven", "seventeen")
4

Distances between number words#

Here we calculate the Levenshtein distance between the words for the first twenty counting numbers in English:

numbers2 = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten",
           "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen",
           "eighteen", "nineteen", "twenty"]
d = np.zeros((20,20))
# Calculate distance between all pairs of strings in numbers2
for i in range(len(numbers2)):
    for j in range(len(numbers2)):
        d[i][j] = distance( numbers2[i], numbers2[j])
plt.imshow(d)
plt.colorbar()
plt.title("Levenshtein distance between number words")
plt.yticks(np.arange(len(numbers2)), numbers2, fontsize=6);
plt.xticks(np.arange(len(numbers2)), numbers2, fontsize=6, rotation=90);
../../_images/Sequence_Based_Similarity_25_0.png

This looks different from the set-based number-word similarity graph we constructed earlier. First, we are measuring distance, not similarity, so the diagonal has distance 0 instead of similarity 1. Second, the range of the distances is integers between 0 and the sum of the lengths of the two strings being compared.

Since this depends on a property of the strings that is not interesting, we can divide by the maximum possible

The calculation of edit distance is, in general, more expensive to calculate than Jaccard similarity or Hamming distance; there are three possible operations at each position, and the operations may need to be repeated.

d2 = np.zeros((20,20))
# Calculate distance between all pairs of strings in numbers2
for i in range(len(numbers2)):
    for j in range(len(numbers2)):
        d2[i][j] = distance( numbers2[i], numbers2[j]) / (len(numbers2[i]) + len(numbers2[j])) 
plt.imshow(1-d2)
plt.colorbar()
plt.title("Levenshtein similarity between number words")
plt.yticks(np.arange(len(numbers2)), numbers2, fontsize=6);
plt.xticks(np.arange(len(numbers2)), numbers2, fontsize=6, rotation=90);
../../_images/Sequence_Based_Similarity_30_0.png

Spelling correction#

One straightforward application of fuzzy matching is spelling correction; after checking each token in a document to see if it is in the dictionary, the words which are not in the dictionary are presented to the user along with suggestions of dictionary words that are similar to the unmatched word.

The fuzzy matching can either start from the misspelled word and generate all possible misspellings out to a certain edit distance, or can start from the words in the dictionary, and find those with the fewest needed edits. Think for a minute which of these might be cheaper (in terms of number of comparisons required.)

words = pd.read_csv("../../data/SINGLE.txt", header=None)
words
0
0 &c
1 'd
2 'em
3 'll
4 'm
... ...
354979 zymurgy
354980 zythem
354981 zythum
354982 zyzzyva
354983 zyzzyvas

354984 rows × 1 columns

# Start with a special-purpose function that only finds the distance
# to one word of interest: 
word = "wprd"   # This is a misspelling
def dist_to_word(w):
        return distance(w, word)
dist_to_word("worm"), dist_to_word("wierd"), dist_to_word("weird")
(2, 2, 2)
# Check that d(word, word) == 0 
dist_to_word("word") == 0 
False

Check that we can find all the distances in one step:

#note xmode Minimal shortens the error message
%xmode Minimal
Exception reporting mode: Minimal
words[0].apply(dist_to_word)
TypeError: object of type 'float' has no len()

Well, that didn’t work. It seems some of the words in the dictionary aren’t encoded as strings. Let us find the rows that are broken:

def test_str(s):
    return type(s) == str
words[0].apply(test_str)
0         True
1         True
2         True
3         True
4         True
          ... 
354979    True
354980    True
354981    True
354982    True
354983    True
Name: 0, Length: 354984, dtype: bool
badindexes = np.where(words[0].apply(test_str)==False)
badindexes
(array([186010, 186271, 198557]),)

Three rows have non-string names in them.

words.loc[badindexes]
0
186010 NaN
186271 NaN
198557 NaN

The non-string rows are all encoded as NaN (Not-a-number, a special symbol that acts like a number). Examining the dataframe does not tell us about the source of the problem. Let’s try specifying a data type to read_csv and hope for the best:

words = pd.read_csv("../../data/SINGLE.txt", header=None, dtype="str")
words.loc[badindexes]
0
186010 NaN
186271 NaN
198557 NaN

That didn’t work either, the bad values are still bad. The documentation for read_csv https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html has a parameter called na_filter that can be turned off.

words = pd.read_csv("../../data/SINGLE.txt", header=None, na_filter=False)

And now we test for bad values:

np.where(words[0].apply(test_str)==False)
(array([], dtype=int64),)

We find none. What were those bad values, now?

words.loc[badindexes]
0
186010 n/a
186271 nan
198557 null

Huh. Might want to take note of that in the future, some strings, by default, get turned into number-like NaNs.

Now let us find the distances between our word and all the other words:

# define this function in a function that we can use on all the rows in 
# the dictionary: 
def distance_to_all(word):
    def dist_to_word(w):
        return distance(w, word)
    distances = words[0].apply(dist_to_word)
    return distances
distance_to_all("wprd")
0         4
1         3
2         4
3         4
4         4
         ..
354979    6
354980    6
354981    6
354982    7
354983    8
Name: 0, Length: 354984, dtype: int64

Let’s make a new dataframe with the words and the distances to “wprd”:

wprd_dist = words.copy()
wprd_dist["distance"] = distance_to_all("wprd")
wprd_dist
0 distance
0 &c 4
1 'd 3
2 'em 4
3 'll 4
4 'm 4
... ... ...
354979 zymurgy 6
354980 zythem 6
354981 zythum 6
354982 zyzzyva 7
354983 zyzzyvas 8

354984 rows × 2 columns

Let us examine the histogram of distances. All the words in the dictionary are in there.

wprd_dist.distance.value_counts()
distance
8     53564
7     50487
9     47505
6     40228
10    38703
11    29049
5     24864
12    21142
13    14115
4     13233
14     8953
15     5167
16     2858
3      2296
17     1419
18      668
19      335
2       147
20      133
21       72
22       20
23       11
24        4
25        3
1         3
27        2
28        1
29        1
26        1
Name: count, dtype: int64

This ordering of the histogram is not ideal for understanding, let us sort by distance:

wprd_dist.distance.value_counts().sort_index()
distance
1         3
2       147
3      2296
4     13233
5     24864
6     40228
7     50487
8     53564
9     47505
10    38703
11    29049
12    21142
13    14115
14     8953
15     5167
16     2858
17     1419
18      668
19      335
20      133
21       72
22       20
23       11
24        4
25        3
26        1
27        2
28        1
29        1
Name: count, dtype: int64

From the table, which gives the values of edit distance as the index and the number of dictionary words at each distance from our target word as values, we see there are three dictionary words at distance 1, and 147 words at edit distance 2.

wprd_dist.query("distance < 2")
0 distance
346643 ward 1
350396 wird 1
351455 word 1
wprd_dist.query("distance == 2")
0 distance
38 3rd 2
15624 aped 2
16021 apod 2
16436 appd 2
16905 apr 2
... ... ...
352228 wur 2
352283 wynd 2
352930 yard 2
353296 yerd 2
353408 yird 2

147 rows × 2 columns

This certainly gives us a starting point (or at least a partly ordered list) of plausible words.

Diffs#

An alignment-based representation of the changes to a document is sometimes an effective way of learning what has changed; algorithms that will compare documents, find the minimum number of edits and show what the edits are with highlighting can be helpful when examining changes to documents or code.