Reduced alphabet similarity

Reduced alphabet similarity#

Since exact matches are cheap, one might wonder if there is a way to use a lossy encoding on the data and test for exact matches of an inexact encoding. This is the way of restricted alphabets.

Soundex#

The Soundex algorithm is a procedure for encoding family names developed by the U.S. Census bureau. It is described in patents 1,261,167 [1] and 1,435,663[2]. The Works Progress Administration hired thousands of people to create indexes to the 1920, 1900, and 1880 U.S. Decennial censuses during the 1930s, copying some of the information onto cards that were sorted by Soundex code and transferred to microfilm.[3]

Soundex constructs a four-character symbol (one initial letter and three numbers) for each name. This symbol represents only some of the essential letters and sounds in the name. This dramatically reduces the number of (distinct) objects in the index, giving names with similar sounds (or misspellings) the same symbol.

The algorithm#

The algorithm is as follows:

The first letter of the name is retained as the first letter of the code.
Double letters and letters with identical numeric codes are combined.
If two consonants are separated by a vowel, both consonants are encoded; if two consonants are separated by a H or W and have the same Soundex number, they are combined.
The first three consonants in the name are replaced by numbers, excepting final S and Z; if there are fewer than three, add zeros.

If we work through the algorithm, we can encode a handful of names:

CAMPBELL: The first letter is C,

second consonant is M which encodes as 5
the next consonants P and B both encode to the same code (1), so are combined.
the third consonant to encode is L (4)
So CAMPBELL encodes as C514.

NICOLAE: The first letter is N,

the second consonant is C which codes as 2,
the next consonant is L which codes as 4
and we have run out of consonants, so we code NICOLAE as N24 or N240.

Installing a library to calculate Soundex#

Santhosh Thottingal has written an implementation of Soundex that we can use since it’s licensed under the GPL: libindic/soundex

%pip install soundex;

Collecting soundex
  Downloading soundex-1.1.3.tar.gz (9.1 kB)

  Preparing metadata (setup.py) ... ?25l-

 done
?25hCollecting silpa_common>=0.3 (from soundex)
  Downloading silpa_common-0.3.tar.gz (9.4 kB)

  Preparing metadata (setup.py) ... ?25l-

 done
?25hBuilding wheels for collected packages: soundex, silpa_common

  Building wheel for soundex (setup.py) ... ?25l-

 done
?25h  Created wheel for soundex: filename=soundex-1.1.3-py3-none-any.whl size=8928 sha256=3c08b5425c9223bb9ce9305b4b7f17e99a36f302d1f9e2a5d241b408e7540eb4
  Stored in directory: /home/jovyan/.cache/pip/wheels/f2/3f/e6/7abb4f928e7684b9f65509ebcd241cad91f260e35d1b9c0bc2

  Building wheel for silpa_common (setup.py) ... ?25l-

 done
?25h  Created wheel for silpa_common: filename=silpa_common-0.3-py3-none-any.whl size=8523 sha256=aa85b6ca178bf4d96f14e25da302b897a0ee39ee9a9432949f37ba36eaaa3c44
  Stored in directory: /home/jovyan/.cache/pip/wheels/cf/d2/8d/3e359d137f6114538f7b9c6b06c1855a12050de481790f839c
Successfully built soundex silpa_common

Installing collected packages: silpa_common, soundex
Successfully installed silpa_common-0.3 soundex-1.1.3

Note: you may need to restart the kernel to use updated packages.

from soundex import Soundex
soundex = Soundex().soundex

And we can check that our answers above match Thottingal’s implementation:

soundex("CAMPBELL") , soundex("NICOLAE") 

('C514', 'N24')

We get the same answer for these two names.

Applying Soundex to a list of names#

To get a sense of what Soundex does to names, let us calculate the Soundex codes for the 162k most common surnames from the 2010 US Decennial Census[4]. Each of these names occurred more than 100 times in the 2010 US Decennial census, and together these names represent about 90% of the people enumerated.

surnames2010 = pd.read_csv("../../data/Names_2010Census.csv", na_filter=False)
surnames2010

	name	rank	count	prop100k	cum_prop100k	pctwhite	pctblack	pctapi	pctaian	pct2prace	pcthispanic
0	SMITH	1	2442977	828.19	828.19	70.9	23.11	0.5	0.89	2.19	2.4
1	JOHNSON	2	1932812	655.24	1483.42	58.97	34.63	0.54	0.94	2.56	2.36
2	WILLIAMS	3	1625252	550.97	2034.39	45.75	47.68	0.46	0.82	2.81	2.49
3	BROWN	4	1437026	487.16	2521.56	57.95	35.6	0.51	0.87	2.55	2.52
4	JONES	5	1425470	483.24	3004.80	55.19	38.48	0.44	1	2.61	2.29
...	...	...	...	...	...	...	...	...	...	...	...
162249	DIETZMANN	160975	100	0.03	90062.93	96	0	0	(S)	0	(S)
162250	DOKAS	160975	100	0.03	90062.96	94	(S)	0	0	(S)	(S)
162251	DONLEA	160975	100	0.03	90062.99	94	0	0	0	0	6
162252	DORIOTT	160975	100	0.03	90063.03	89	0	(S)	0	5	(S)
162253	ALL OTHER NAMES	0	29312001	9936.97	9936.97	66.65	8.53	7.97	0.86	2.32	13.67

162254 rows × 11 columns

Now that we have a list of 160k names, we can apply Soundex to each one. The Soundex codes will group the names even more broadly; we will be able to see the collections of identically-coded names once we calculate them.

surnames2010.name.apply(soundex)

0            S53
1           J525
2           W452
3            B65
4            J52
           ...  
162249      D325
162250        D2
162251       D54
162252       D63
162253    A43652
Name: name, Length: 162254, dtype: object

These codes are of variable length, and I was expecting codes no longer than four characters. I can define a function to apply Soundex but return a code that only has the first four characters:

def soundex4(name):
    return soundex(name, length=4)[0:4]

codes = surnames2010.name.apply(soundex4)

# Put Soundex codes into a new column
surnames2010["soundex"] = codes

surnames2010[["name", "soundex", "count"]]

	name	soundex	count
0	SMITH	S53	2442977
1	JOHNSON	J525	1932812
2	WILLIAMS	W452	1625252
3	BROWN	B65	1437026
4	JONES	J52	1425470
...	...	...	...
162249	DIETZMANN	D325	100
162250	DOKAS	D2	100
162251	DONLEA	D54	100
162252	DORIOTT	D63	100
162253	ALL OTHER NAMES	A436	29312001

162254 rows × 3 columns

Now I have a datastructure that allows me to look up the codes for 162k names.

It seems likely that all the names are distinct, but we can check:

len(surnames2010.name.value_counts()), len(surnames2010)

(162254, 162254)

Yes, they are all distinct.

So, how many Soundex codes are there for 162k names?

len(surnames2010.soundex.value_counts())

There are only 4,000 codes. This was the purpose of the indexing system; to put similarly-sounding names in the same file despite differences in spelling.

What are the most common codes?

surnames2010.soundex.value_counts()

soundex
M2      921
B2      818
B62     777
R2      768
L2      733
       ... 
Y341      1
Q435      1
L354      1
Q364      1
O314      1
Name: count, Length: 4162, dtype: int64

Note there are more than 900 names on this list with the code M2. What are these names?

surnames2010.loc[surnames2010.soundex=='M2']

	name	rank	count	prop100k	cum_prop100k	pctwhite	pctblack	pctapi	pctaian	pct2prace	pcthispanic	soundex
277	MCCOY	278	110744	37.54	26593.57	66.88	26.71	0.46	0.89	2.43	2.64	M2
299	MEJIA	300	104057	35.28	27391.54	3.37	0.45	1.33	0.14	0.2	94.51	M2
422	MOSS	423	76908	26.07	31052.62	69.73	24.71	0.48	0.68	2.12	2.28	M2
438	MCGEE	439	74542	25.27	31462.07	62.25	31.53	0.45	0.73	2.72	2.32	M2
467	MACK	468	71056	24.09	32176.67	44.76	49.06	0.7	0.53	2.59	2.37	M2
...	...	...	...	...	...	...	...	...	...	...	...	...
161463	MIGOYA	160975	100	0.03	90036.28	15	0	0	0	0	85	M2
161467	MIHAS	160975	100	0.03	90036.42	91	0	(S)	0	(S)	(S)	M2
161736	MAKAS	160975	100	0.03	90045.54	95	0	(S)	0	(S)	(S)	M2
161785	MAZAIKA	160975	100	0.03	90047.20	100	0	0	0	0	0	M2
161786	MAZEWSKI	160975	100	0.03	90047.23	100	0	0	0	0	0	M2

921 rows × 12 columns

It now may be clear that while it is easy to calculate Soundex from the name, the reverse procedure, determining the name from the Soundex code, is not possible.

Here is the table of Soundex codes and the number of names (out of our list of 162k):

surnames2010.soundex.value_counts()

soundex
M2      921
B2      818
B62     777
R2      768
L2      733
       ... 
Y341      1
Q435      1
L354      1
Q364      1
O314      1
Name: count, Length: 4162, dtype: int64

From this, we can examine a few of the names corresponding to Soundex codes that are unique in our list of 162k names:

(surnames2010.loc[surnames2010.soundex=='Q523'][["name", "soundex", "count"]], 
surnames2010.loc[surnames2010.soundex=='I164'][["name", "soundex", "count"]],
surnames2010.loc[surnames2010.soundex=='P136'][["name", "soundex", "count"]],
surnames2010.loc[surnames2010.soundex=='C464'][["name", "soundex", "count"]])

(            name soundex  count
 85498  QUANSTROM    Q523    219,
            name soundex  count
 85280  IBARROLA    I164    220,
              name soundex  count
 85205  PAPAPIETRO    P136    220,
              name soundex  count
 162125  COLARELLI    C464    100)

But we still can’t recover names from Soundex codes, even apparently unique ones, because our list of names was not comprehensive; the 2010 surnames dataset did not include 6 million distinct names that occurred fewer than 100 times. Many of our codes with “unique” names will lose their status when we add more names.

Soundex isn’t the end#

There are other phonetic encoding systems, including Double Metaphone and New York State Identification and Intelligence System Phonetic Code (NYIIS) that aim to identify similar-sounding names and words, and were designed with deficiencies in the Soundex encoding in mind.

A phonetic name-matching system is used by the US Transportation Security Administration for enforcing the U.S. No-Fly List.[5]. Matching on lossy phonetic encodings increases the number of travelers identified by the system as meriting closer scrutiny by requiring only similar-sounding names to names on the list.

Murphy alphabets#

(Murphy 2000 10.1093/protein/13.3.149)

In biological sequence analysis, alignments between biological molecules inferred to have common ancestry are a central problem. There are technologies that can turn information containing polymers into data representing the genetic information they contain.