Canonicalization#

In several areas, there are explicit rules to standardize the representation of data. This procedure, called canonicalization or normalization, removes differences that are not important and permits tests for exact matches to succeed even when the details of the representation have been changed.

Name normalization#

Consider a list of names.

names = [ "THOMPSON, EMILY",
          "THOMPSON,EMILY",
          "THOMPSON,   EMILY ",
          "Thompson, Emily ",
          "Thompson, Emily A."]

We cannot tell how many Emily Thompsons are in this database, but the number of spaces is usually not relevant.

The string methods .upper() and .lower() and .strip() are potentially useful to this end.

If we wanted to make all of these the same, we could write a function to clean up capitalization and spacing differences:

def clean(s):
    fields = s.upper().strip().split(',')
    return(fields[0].strip() + ", " + fields[1].strip())

for name in names:
    print( clean(name))
THOMPSON, EMILY
THOMPSON, EMILY
THOMPSON, EMILY
THOMPSON, EMILY
THOMPSON, EMILY A.

Now a “messy” list of five different Emilies is now a list of only two (possibly) different names.

URI normalization#

As an example of how engineers use normalization, we can consider Uniform Resource Identifier (URI) normalization. URIs are addresses that provide the location of digital resources; URLs are a subset of the standard specifying URIs.

URI normalization is described in the standards document RFC 3986. The standard contains a list of algorithmic modifications to a URI aimed to allow the sameness of a URI to be identified.

This set of rules permits a webserver to store only one copy of some page even if browsers ask for it with dozens of synonymous addresses.

Duplicate slashes should be removed:

http://example.com/foo//bar.html → http://example.com/foo/bar.html

Relative directory navigation symbols should be interpreted and removed:

http://example.com/foo/../bar.html → http://example.com/bar.html

Certain ascii characters do not require percent encoding in URI strings, and should be decoded:

http://example.com/%7Efoo → http://example.com/~foo

Although the standard allows multiple, synonymous addresses to be constructed, it would waste resources if it were not possible to boil disparate addresses down to a canonical form.

Cleaning up our data#

Differences in the data representation that are superficial and unimportant can sometimes stand in the way of analysis. As a first step, we should explore our data to find what kinds of extraneous differences in encodings are present, and then make a copy of the data that has been “cleaned” of the differences that we were able to identify.

The practice of collecting unusual (or pathological) examples of data can help you make your code work better.