Fuzzy matching#

Will Trimble

Data analysis often consists of separating some valued property of data called “signal” from similar properties of the data that are not invested with value, called “noise.” Processes that make different items in the real world have different representations in datasets abound, and there are many cases where the ability to match similar but not exactly identical symbols will allow better answers than if our procedure required exact matches. Manual data entry, corrupt data, missing values, and symbols which cannot be unambiguously resolved can all stand in the way of answering data questions. This section describes some algorithmic approaches to detecting when two strings or database rows are “close” to each other.

Exact matching is easy and fast for computers, but finding inexact matches requires some more effort.

A typical fuzzy-matching example is the task of locating a customer in a customer management database.

A customer calls to inquire about an order, but, as occasionally happens, some of the data in the database (confirmation number, customer’s name, address) has been entered incorrectly. If the system cannot find an exact match, there should be a procedure to present possible imperfect matches to the customer service rep in hopes of finding the right one. It would be nice if the imperfect matches can be presented in an order of decreasing probability-of-being-correct, to minimize the number of records the agent needs to check.

Some mixture of Levenshtein distance, Hamming distance, and phonetic distances on different fields can be used to make a tool that more efficiently locates slightly-corrupt records. Matching records that are incomplete or have been corrupted is imperfect and can make both the mistakes of failing to match things which should be matched and matching things which should not be matched; these errors are the price of more complete joining of data.

Another application of fuzzy matching is search. Consider a database of documents, perhaps newspaper articles or legal filings. A search engine looking for mentions of a word or phrase will fail to find misspellings in the data if it uses exact matching, and will fail to find relevant documents if the search term is misspelled. Finding most of the instances of the phrase you are searching for may be suitable for some tasks, and may ruin others.

In biological sequence analysis, the molecular clock hypothesis allows sequence similarity to be interpreted as evolutionary relatedness; doing anything useful with this requires a set of procedures for aligning homologous sequences of information-containing biological molecules to infer the mutations that have accrued since (the molecule’s) common ancestor.

And a more mundane example: when editing a document, it is useful to be able to highlight which parts of the document have been changed between two versions of the document. Large parts of the document may be identical; it is useful to have an algorithm that identifies regions of similarity and difference and data structures that can communicate the differences in a succinct way.

Here we describe a few of the tools that can be used to evaluate imprecise matches.