Reduced alphabet similarity
Contents
Reduced alphabet similarity#
Since exact matches are cheap, one might wonder if there is a way to use a lossy encoding on the data and test for exact matches of an inexact encoding. This is the way of restricted alphabets.
Soundex#
The Soundex algorithm is a procedure for encoding family names developed by the U.S. Census bureau. It is described in patents 1,261,167 1 and 1,435,6632. The Works Progress Administration hired thousands of people to create indexes to the 1920, 1900, and 1880 U.S. Decennial censuses during the 1930s, copying some of the information onto cards that were sorted by Soundex code and transferred to microfilm.3
Soundex constructs a four-character symbol (one initial letter and three numbers) for each name. This symbol represents only some of the essential letters and sounds in the name. This dramatically reduces the number of (distinct) objects in the index, giving names with similar sounds (or misspellings) the same symbol.
The algorithm#
The algorithm is as follows:
The first letter of the name is retained as the first letter of the code.
Double letters and letters with identical numeric codes are combined.
If two consonants are separated by a vowel, both consonants are encoded; if two consonants are separated by a H or W and have the same Soundex number, they are combined.
The first three consonants in the name are replaced by numbers, excepting final S and Z; if there are fewer than three, add zeros.
If we work through the algorithm, we can encode a handful of names:
CAMPBELL: The first letter is C,
second consonant is M which encodes as 5
the next consonants P and B both encode to the same code (1), so are combined.
the third consonant to encode is L (4)
So CAMPBELL encodes as C514.
NICOLAE: The first letter is N,
the second consonant is C which codes as 2,
the next consonant is L which codes as 4
and we have run out of consonants, so we code NICOLAE as N24 or N240.
Installing a library to calculate Soundex#
Santhosh Thottingal has written an implementation of Soundex that we can use since it’s licensed under the GPL: https://github.com/libindic/soundex
%pip install soundex;
WARNING: The directory '/home/jovyan/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Collecting soundex
Downloading soundex-1.1.3.tar.gz (9.1 kB)
Preparing metadata (setup.py) ... ?25l-
\
|
done
?25hCollecting silpa_common>=0.3 (from soundex)
Downloading silpa_common-0.3.tar.gz (9.4 kB)
Preparing metadata (setup.py) ... ?25l-
\
|
done
?25hBuilding wheels for collected packages: soundex, silpa_common
Building wheel for soundex (setup.py) ... ?25l-
\
|
done
?25h Created wheel for soundex: filename=soundex-1.1.3-py3-none-any.whl size=8875 sha256=9686c4b2ff90915f16c1ddcfbb38b8028a06b924a2826423f15642a55c6a472e
Stored in directory: /tmp/pip-ephem-wheel-cache-rzpbgyr_/wheels/65/12/78/4d9a7d5a0522543b94535da7ed2b0fdbc4424add35be211f9e
Building wheel for silpa_common (setup.py) ... ?25l-
\
|
done
?25h Created wheel for silpa_common: filename=silpa_common-0.3-py3-none-any.whl size=8465 sha256=613087f8b441ae4ec43d84252c776039f78bc8e2e283cb3a0dd652ab039b9ee8
Stored in directory: /tmp/pip-ephem-wheel-cache-rzpbgyr_/wheels/1b/d2/97/c3c282b834d294b58cafd9e3ea4326e15fad90d6018098cddc
Successfully built soundex silpa_common
Installing collected packages: silpa_common, soundex
Successfully installed silpa_common-0.3 soundex-1.1.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Note: you may need to restart the kernel to use updated packages.
from soundex import Soundex
soundex = Soundex().soundex
And we can check that our answers above match Thottingal’s implementation:
soundex("CAMPBELL") , soundex("NICOLAE")
('C514', 'N24')
We get the same answer for these two names.
Applying Soundex to a list of names#
To get a sense of what Soundex does to names, let us calculate the Soundex codes for the 162k most common surnames from the 2010 US Decennial Census4. Each of these names occurred more than 100 times in the 2010 US Decennial census, and together these names represent about 90% of the people enumerated.
surnames2010 = pd.read_csv("../../data/Names_2010Census.csv", na_filter=False)
surnames2010
name | rank | count | prop100k | cum_prop100k | pctwhite | pctblack | pctapi | pctaian | pct2prace | pcthispanic | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | SMITH | 1 | 2442977 | 828.19 | 828.19 | 70.9 | 23.11 | 0.5 | 0.89 | 2.19 | 2.4 |
1 | JOHNSON | 2 | 1932812 | 655.24 | 1483.42 | 58.97 | 34.63 | 0.54 | 0.94 | 2.56 | 2.36 |
2 | WILLIAMS | 3 | 1625252 | 550.97 | 2034.39 | 45.75 | 47.68 | 0.46 | 0.82 | 2.81 | 2.49 |
3 | BROWN | 4 | 1437026 | 487.16 | 2521.56 | 57.95 | 35.6 | 0.51 | 0.87 | 2.55 | 2.52 |
4 | JONES | 5 | 1425470 | 483.24 | 3004.80 | 55.19 | 38.48 | 0.44 | 1 | 2.61 | 2.29 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
162249 | DIETZMANN | 160975 | 100 | 0.03 | 90062.93 | 96 | 0 | 0 | (S) | 0 | (S) |
162250 | DOKAS | 160975 | 100 | 0.03 | 90062.96 | 94 | (S) | 0 | 0 | (S) | (S) |
162251 | DONLEA | 160975 | 100 | 0.03 | 90062.99 | 94 | 0 | 0 | 0 | 0 | 6 |
162252 | DORIOTT | 160975 | 100 | 0.03 | 90063.03 | 89 | 0 | (S) | 0 | 5 | (S) |
162253 | ALL OTHER NAMES | 0 | 29312001 | 9936.97 | 9936.97 | 66.65 | 8.53 | 7.97 | 0.86 | 2.32 | 13.67 |
162254 rows × 11 columns
Now that we have a list of 160k names, we can apply Soundex to each one. The Soundex codes will group the names even more broadly; we will be able to see the collections of identically-coded names once we calculate them.
surnames2010.name.apply(soundex)
0 S53
1 J525
2 W452
3 B65
4 J52
...
162249 D325
162250 D2
162251 D54
162252 D63
162253 A43652
Name: name, Length: 162254, dtype: object
These codes are of variable length, and I was expecting codes no longer than four characters. I can define a function to apply Soundex but return a code that only has the first four characters:
def soundex4(name):
return soundex(name, length=4)[0:4]
codes = surnames2010.name.apply(soundex4)
# Put Soundex codes into a new column
surnames2010["soundex"] = codes
surnames2010[["name", "soundex", "count"]]
name | soundex | count | |
---|---|---|---|
0 | SMITH | S53 | 2442977 |
1 | JOHNSON | J525 | 1932812 |
2 | WILLIAMS | W452 | 1625252 |
3 | BROWN | B65 | 1437026 |
4 | JONES | J52 | 1425470 |
... | ... | ... | ... |
162249 | DIETZMANN | D325 | 100 |
162250 | DOKAS | D2 | 100 |
162251 | DONLEA | D54 | 100 |
162252 | DORIOTT | D63 | 100 |
162253 | ALL OTHER NAMES | A436 | 29312001 |
162254 rows × 3 columns
Now I have a datastructure that allows me to look up the codes for 162k names.
It seems likely that all the names are distinct, but we can check:
len(surnames2010.name.value_counts()), len(surnames2010)
(162254, 162254)
Yes, they are all distinct.
So, how many Soundex codes are there for 162k names?
len(surnames2010.soundex.value_counts())
4162
There are only 4,000 codes. This was the purpose of the indexing system; to put similarly-sounding names in the same file despite differences in spelling.
What are the most common codes?
surnames2010.soundex.value_counts()
soundex
M2 921
B2 818
B62 777
R2 768
L2 733
...
U312 1
I436 1
Q341 1
T136 1
C464 1
Name: count, Length: 4162, dtype: int64
Note there are more than 900 names on this list with the code M2. What are these names?
surnames2010.loc[surnames2010.soundex=='M2']
name | rank | count | prop100k | cum_prop100k | pctwhite | pctblack | pctapi | pctaian | pct2prace | pcthispanic | soundex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
277 | MCCOY | 278 | 110744 | 37.54 | 26593.57 | 66.88 | 26.71 | 0.46 | 0.89 | 2.43 | 2.64 | M2 |
299 | MEJIA | 300 | 104057 | 35.28 | 27391.54 | 3.37 | 0.45 | 1.33 | 0.14 | 0.2 | 94.51 | M2 |
422 | MOSS | 423 | 76908 | 26.07 | 31052.62 | 69.73 | 24.71 | 0.48 | 0.68 | 2.12 | 2.28 | M2 |
438 | MCGEE | 439 | 74542 | 25.27 | 31462.07 | 62.25 | 31.53 | 0.45 | 0.73 | 2.72 | 2.32 | M2 |
467 | MACK | 468 | 71056 | 24.09 | 32176.67 | 44.76 | 49.06 | 0.7 | 0.53 | 2.59 | 2.37 | M2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
161463 | MIGOYA | 160975 | 100 | 0.03 | 90036.28 | 15 | 0 | 0 | 0 | 0 | 85 | M2 |
161467 | MIHAS | 160975 | 100 | 0.03 | 90036.42 | 91 | 0 | (S) | 0 | (S) | (S) | M2 |
161736 | MAKAS | 160975 | 100 | 0.03 | 90045.54 | 95 | 0 | (S) | 0 | (S) | (S) | M2 |
161785 | MAZAIKA | 160975 | 100 | 0.03 | 90047.20 | 100 | 0 | 0 | 0 | 0 | 0 | M2 |
161786 | MAZEWSKI | 160975 | 100 | 0.03 | 90047.23 | 100 | 0 | 0 | 0 | 0 | 0 | M2 |
921 rows × 12 columns
It now may be clear that while it is easy to calculate Soundex from the name, the reverse procedure, determining the name from the Soundex code, is not possible.
Here is the table of Soundex codes and the number of names (out of our list of 162k):
surnames2010.soundex.value_counts()
soundex
M2 921
B2 818
B62 777
R2 768
L2 733
...
U312 1
I436 1
Q341 1
T136 1
C464 1
Name: count, Length: 4162, dtype: int64
From this, we can examine a few of the names corresponding to Soundex codes that are unique in our list of 162k names:
(surnames2010.loc[surnames2010.soundex=='Q523'][["name", "soundex", "count"]],
surnames2010.loc[surnames2010.soundex=='I164'][["name", "soundex", "count"]],
surnames2010.loc[surnames2010.soundex=='P136'][["name", "soundex", "count"]],
surnames2010.loc[surnames2010.soundex=='C464'][["name", "soundex", "count"]])
( name soundex count
85498 QUANSTROM Q523 219,
name soundex count
85280 IBARROLA I164 220,
name soundex count
85205 PAPAPIETRO P136 220,
name soundex count
162125 COLARELLI C464 100)
But we still can’t recover names from Soundex codes, even apparently unique ones, because our list of names was not comprehensive; the 2010 surnames dataset did not include 6 million distinct names that occurred fewer than 100 times. Many of our codes with “unique” names will lose their status when we add more names.
Soundex isn’t the end#
There are other phonetic encoding systems, including Double Metaphone and New York State Identification and Intelligence System Phonetic Code (NYIIS) that aim to identify similar-sounding names and words, and were designed with deficiencies in the Soundex encoding in mind.
A phonetic name-matching system is used by the US Transportation Security Administration for enforcing the U.S. No-Fly List.5. Matching on lossy phonetic encodings increases the number of travelers identified by the system as meriting closer scrutiny by requiring only similar-sounding names to names on the list.
Murphy alphabets#
(Murphy 2000 10.1093/protein/13.3.149)
In biological sequence analysis, alignments between biological molecules inferred to have common ancestry are a central problem. There are technologies that can turn information containing polymers into data representing the genetic information they contain.
- 1
Russell, Robert C. “Improvements in Indexes.” US Patent 1,261,167, granted 1918
- 2
Russell, Robert C. “Improvements in Indexes.” US Patent 1,435,663, granted 1922
- 3
Prechtel-Kluskens, Claire. The WPA Census Soundexing Projects Prologue Magazine. Spring 2002, Vol. 34, No. 1
https://www.archives.gov/publications/prologue/2002/spring/soundex-projects.html- 4
Comentz, Joshua. “Frequently Occurring Surnames in the 2010 Census” U.S. Census bureau technical report. October 2016 https://www.census.gov/topics/population/genealogy/data/2010_surnames.html
- 5
Goo, Sara Kehaulani. “Faulty No-Fly System Detailed.” Washington Post. 2004-10-09