Reduced alphabet similarity#

Since exact matches are cheap, one might wonder if there is a way to use a lossy encoding on the data and test for exact matches of an inexact encoding. This is the way of restricted alphabets.

Soundex#

The Soundex algorithm is a procedure for encoding family names developed by the U.S. Census bureau. It is described in patents 1,261,167 1 and 1,435,6632. The Works Progress Administration hired thousands of people to create indexes to the 1920, 1900, and 1880 U.S. Decennial censuses during the 1930s, copying some of the information onto cards that were sorted by Soundex code and transferred to microfilm.3

Soundex constructs a four-character symbol (one initial letter and three numbers) for each name. This symbol represents only some of the essential letters and sounds in the name. This dramatically reduces the number of (distinct) objects in the index, giving names with similar sounds (or misspellings) the same symbol.

The algorithm#

The algorithm is as follows:

  • The first letter of the name is retained as the first letter of the code.

  • Double letters and letters with identical numeric codes are combined.

  • If two consonants are separated by a vowel, both consonants are encoded; if two consonants are separated by a H or W and have the same Soundex number, they are combined.

  • The first three consonants in the name are replaced by numbers, excepting final S and Z; if there are fewer than three, add zeros.

If we work through the algorithm, we can encode a handful of names:

CAMPBELL: The first letter is C,

  • second consonant is M which encodes as 5

  • the next consonants P and B both encode to the same code (1), so are combined.

  • the third consonant to encode is L (4)

  • So CAMPBELL encodes as C514.

NICOLAE: The first letter is N,

  • the second consonant is C which codes as 2,

  • the next consonant is L which codes as 4

  • and we have run out of consonants, so we code NICOLAE as N24 or N240.

Installing a library to calculate Soundex#

Santhosh Thottingal has written an implementation of Soundex that we can use since it’s licensed under the GPL: https://github.com/libindic/soundex

%pip install soundex;
WARNING: The directory '/home/jovyan/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.

Collecting soundex
  Downloading soundex-1.1.3.tar.gz (9.1 kB)
  Preparing metadata (setup.py) ... ?25l-
 \
 |
 done
?25hCollecting silpa_common>=0.3 (from soundex)
  Downloading silpa_common-0.3.tar.gz (9.4 kB)
  Preparing metadata (setup.py) ... ?25l-
 \
 |
 done
?25hBuilding wheels for collected packages: soundex, silpa_common
  Building wheel for soundex (setup.py) ... ?25l-
 \
 |
 done
?25h  Created wheel for soundex: filename=soundex-1.1.3-py3-none-any.whl size=8875 sha256=9686c4b2ff90915f16c1ddcfbb38b8028a06b924a2826423f15642a55c6a472e
  Stored in directory: /tmp/pip-ephem-wheel-cache-rzpbgyr_/wheels/65/12/78/4d9a7d5a0522543b94535da7ed2b0fdbc4424add35be211f9e
  Building wheel for silpa_common (setup.py) ... ?25l-
 \
 |
 done
?25h  Created wheel for silpa_common: filename=silpa_common-0.3-py3-none-any.whl size=8465 sha256=613087f8b441ae4ec43d84252c776039f78bc8e2e283cb3a0dd652ab039b9ee8
  Stored in directory: /tmp/pip-ephem-wheel-cache-rzpbgyr_/wheels/1b/d2/97/c3c282b834d294b58cafd9e3ea4326e15fad90d6018098cddc
Successfully built soundex silpa_common
Installing collected packages: silpa_common, soundex
Successfully installed silpa_common-0.3 soundex-1.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Note: you may need to restart the kernel to use updated packages.
from soundex import Soundex
soundex = Soundex().soundex

And we can check that our answers above match Thottingal’s implementation:

soundex("CAMPBELL") , soundex("NICOLAE") 
('C514', 'N24')

We get the same answer for these two names.

Applying Soundex to a list of names#

To get a sense of what Soundex does to names, let us calculate the Soundex codes for the 162k most common surnames from the 2010 US Decennial Census4. Each of these names occurred more than 100 times in the 2010 US Decennial census, and together these names represent about 90% of the people enumerated.

surnames2010 = pd.read_csv("../../data/Names_2010Census.csv", na_filter=False)
surnames2010
name rank count prop100k cum_prop100k pctwhite pctblack pctapi pctaian pct2prace pcthispanic
0 SMITH 1 2442977 828.19 828.19 70.9 23.11 0.5 0.89 2.19 2.4
1 JOHNSON 2 1932812 655.24 1483.42 58.97 34.63 0.54 0.94 2.56 2.36
2 WILLIAMS 3 1625252 550.97 2034.39 45.75 47.68 0.46 0.82 2.81 2.49
3 BROWN 4 1437026 487.16 2521.56 57.95 35.6 0.51 0.87 2.55 2.52
4 JONES 5 1425470 483.24 3004.80 55.19 38.48 0.44 1 2.61 2.29
... ... ... ... ... ... ... ... ... ... ... ...
162249 DIETZMANN 160975 100 0.03 90062.93 96 0 0 (S) 0 (S)
162250 DOKAS 160975 100 0.03 90062.96 94 (S) 0 0 (S) (S)
162251 DONLEA 160975 100 0.03 90062.99 94 0 0 0 0 6
162252 DORIOTT 160975 100 0.03 90063.03 89 0 (S) 0 5 (S)
162253 ALL OTHER NAMES 0 29312001 9936.97 9936.97 66.65 8.53 7.97 0.86 2.32 13.67

162254 rows × 11 columns

Now that we have a list of 160k names, we can apply Soundex to each one. The Soundex codes will group the names even more broadly; we will be able to see the collections of identically-coded names once we calculate them.

surnames2010.name.apply(soundex)
0            S53
1           J525
2           W452
3            B65
4            J52
           ...  
162249      D325
162250        D2
162251       D54
162252       D63
162253    A43652
Name: name, Length: 162254, dtype: object

These codes are of variable length, and I was expecting codes no longer than four characters. I can define a function to apply Soundex but return a code that only has the first four characters:

def soundex4(name):
    return soundex(name, length=4)[0:4]
codes = surnames2010.name.apply(soundex4)
# Put Soundex codes into a new column
surnames2010["soundex"] = codes
surnames2010[["name", "soundex", "count"]]
name soundex count
0 SMITH S53 2442977
1 JOHNSON J525 1932812
2 WILLIAMS W452 1625252
3 BROWN B65 1437026
4 JONES J52 1425470
... ... ... ...
162249 DIETZMANN D325 100
162250 DOKAS D2 100
162251 DONLEA D54 100
162252 DORIOTT D63 100
162253 ALL OTHER NAMES A436 29312001

162254 rows × 3 columns

Now I have a datastructure that allows me to look up the codes for 162k names.

It seems likely that all the names are distinct, but we can check:

len(surnames2010.name.value_counts()), len(surnames2010)
(162254, 162254)

Yes, they are all distinct.

So, how many Soundex codes are there for 162k names?

len(surnames2010.soundex.value_counts())
4162

There are only 4,000 codes. This was the purpose of the indexing system; to put similarly-sounding names in the same file despite differences in spelling.

What are the most common codes?

surnames2010.soundex.value_counts()
soundex
M2      921
B2      818
B62     777
R2      768
L2      733
       ... 
U312      1
I436      1
Q341      1
T136      1
C464      1
Name: count, Length: 4162, dtype: int64

Note there are more than 900 names on this list with the code M2. What are these names?

surnames2010.loc[surnames2010.soundex=='M2']
name rank count prop100k cum_prop100k pctwhite pctblack pctapi pctaian pct2prace pcthispanic soundex
277 MCCOY 278 110744 37.54 26593.57 66.88 26.71 0.46 0.89 2.43 2.64 M2
299 MEJIA 300 104057 35.28 27391.54 3.37 0.45 1.33 0.14 0.2 94.51 M2
422 MOSS 423 76908 26.07 31052.62 69.73 24.71 0.48 0.68 2.12 2.28 M2
438 MCGEE 439 74542 25.27 31462.07 62.25 31.53 0.45 0.73 2.72 2.32 M2
467 MACK 468 71056 24.09 32176.67 44.76 49.06 0.7 0.53 2.59 2.37 M2
... ... ... ... ... ... ... ... ... ... ... ... ...
161463 MIGOYA 160975 100 0.03 90036.28 15 0 0 0 0 85 M2
161467 MIHAS 160975 100 0.03 90036.42 91 0 (S) 0 (S) (S) M2
161736 MAKAS 160975 100 0.03 90045.54 95 0 (S) 0 (S) (S) M2
161785 MAZAIKA 160975 100 0.03 90047.20 100 0 0 0 0 0 M2
161786 MAZEWSKI 160975 100 0.03 90047.23 100 0 0 0 0 0 M2

921 rows × 12 columns

It now may be clear that while it is easy to calculate Soundex from the name, the reverse procedure, determining the name from the Soundex code, is not possible.

Here is the table of Soundex codes and the number of names (out of our list of 162k):

surnames2010.soundex.value_counts()
soundex
M2      921
B2      818
B62     777
R2      768
L2      733
       ... 
U312      1
I436      1
Q341      1
T136      1
C464      1
Name: count, Length: 4162, dtype: int64

From this, we can examine a few of the names corresponding to Soundex codes that are unique in our list of 162k names:

(surnames2010.loc[surnames2010.soundex=='Q523'][["name", "soundex", "count"]], 
surnames2010.loc[surnames2010.soundex=='I164'][["name", "soundex", "count"]],
surnames2010.loc[surnames2010.soundex=='P136'][["name", "soundex", "count"]],
surnames2010.loc[surnames2010.soundex=='C464'][["name", "soundex", "count"]])
(            name soundex  count
 85498  QUANSTROM    Q523    219,
            name soundex  count
 85280  IBARROLA    I164    220,
              name soundex  count
 85205  PAPAPIETRO    P136    220,
              name soundex  count
 162125  COLARELLI    C464    100)

But we still can’t recover names from Soundex codes, even apparently unique ones, because our list of names was not comprehensive; the 2010 surnames dataset did not include 6 million distinct names that occurred fewer than 100 times. Many of our codes with “unique” names will lose their status when we add more names.

Soundex isn’t the end#

There are other phonetic encoding systems, including Double Metaphone and New York State Identification and Intelligence System Phonetic Code (NYIIS) that aim to identify similar-sounding names and words, and were designed with deficiencies in the Soundex encoding in mind.

A phonetic name-matching system is used by the US Transportation Security Administration for enforcing the U.S. No-Fly List.5. Matching on lossy phonetic encodings increases the number of travelers identified by the system as meriting closer scrutiny by requiring only similar-sounding names to names on the list.

Murphy alphabets#

(Murphy 2000 10.1093/protein/13.3.149)

In biological sequence analysis, alignments between biological molecules inferred to have common ancestry are a central problem. There are technologies that can turn information containing polymers into data representing the genetic information they contain.


1

Russell, Robert C. “Improvements in Indexes.” US Patent 1,261,167, granted 1918

2

Russell, Robert C. “Improvements in Indexes.” US Patent 1,435,663, granted 1922

3

Prechtel-Kluskens, Claire. The WPA Census Soundexing Projects Prologue Magazine. Spring 2002, Vol. 34, No. 1
https://www.archives.gov/publications/prologue/2002/spring/soundex-projects.html

4

Comentz, Joshua. “Frequently Occurring Surnames in the 2010 Census” U.S. Census bureau technical report. October 2016 https://www.census.gov/topics/population/genealogy/data/2010_surnames.html

5

Goo, Sara Kehaulani. “Faulty No-Fly System Detailed.” Washington Post. 2004-10-09