Edit distance example: Building inspection reports#

Next we will be looking at a dataset with free-form human-entered text that has a field with differences that are not meaningful. Meaningless differences between records that should be identical make it more challenging to find parts of the dataset that are similar (and should be treated the same).

This dataset is from the City of Chicago data portal, and documents building inspections in the City of Chicago Department of Buildings. City of Chicago Data Portal. https://data.cityofchicago.org/Buildings/Building-Violations/22u3-xenr Accessed September 20, 2023, filtered for violations from the first half of January 2022.

violations_df = pd.read_csv("../../data/ChicagoBuildingViolationsEarlyJanuary2023.csv")

If your environment does not have the levenshtein package installed, see if you can install it.

%pip install levenshtein
#!conda install -c conda-forge python-levenshtein
WARNING: The directory '/home/jovyan/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.

Requirement already satisfied: levenshtein in /opt/conda/lib/python3.11/site-packages (0.22.0)
Requirement already satisfied: rapidfuzz<4.0.0,>=2.3.0 in /opt/conda/lib/python3.11/site-packages (from levenshtein) (3.3.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Note: you may need to restart the kernel to use updated packages.
violations_df
ID VIOLATION LAST MODIFIED DATE VIOLATION DATE VIOLATION CODE VIOLATION STATUS VIOLATION STATUS DATE VIOLATION DESCRIPTION VIOLATION LOCATION VIOLATION INSPECTOR COMMENTS VIOLATION ORDINANCE ... STREET NAME STREET TYPE PROPERTY GROUP SSA LATITUDE LONGITUDE LOCATION YEAR MONTH DAY
0 6729404 06/27/2023 01/10/2022 EV1111 COMPLIED 06/27/2023 MAINTAIN OR REPAIR HYDRO ELEVA NaN CITY FIRE RECALL TEST Failed to maintain hydraulic elevator equipmen... ... OAKDALE AVE 8066 8.0 41.935438 -87.642694 (41.93543799801343, -87.64269400121434) 2022 1 10
1 6726715 09/06/2022 01/03/2022 CN070024 COMPLIED 09/06/2022 REPAIR PORCH SYSTEM OTHER : :OTHER371335 FRONT PORCH - ROTTED WOOD CEILING, MISSING ALU... Failed to repair or replace defective or missi... ... STATE ST 535078 NaN 41.675337 -87.622431 (41.67533708701544, -87.6224312766971) 2022 1 3
2 6726705 06/29/2023 01/03/2022 CN071024 COMPLIED 06/29/2023 REPAIR STOOP OTHER : :OTHER FRONT STOOP, SIDE WALLS - WASHED OUT MORTAR ( ... Failed to maintain stoop in safe condition and... ... ESCANABA AVE 435511 NaN 41.749876 -87.553927 (41.749876029131826, -87.55392704663606) 2022 1 3
3 6731812 07/14/2023 01/13/2022 CN062024 COMPLIED 07/14/2023 PARAPET OTHER : :OTHER SOUTHWEST FRONT GARAGE PERAPET - WASHED OUT MO... Failed to maintain parapet wall in good repair... ... 103RD ST 24594 NaN 41.706930 -87.633353 (41.70693036655445, -87.6333526539513) 2022 1 13
4 6730529 06/13/2022 01/12/2022 CN136026 COMPLIED 06/13/2022 INSECTS OTHER : :OTHER BED BUG INFESTATION THROUGHOUT. - PROPERTY MUS... Exterminate insects and keep dwelling insect-f... ... LINCOLN AVE 12595 NaN 41.983469 -87.695449 (41.98346922607376, -87.69544875484024) 2022 1 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1827 6730801 06/22/2023 01/12/2022 NC2020 COMPLIED 06/22/2023 PERMIT REQUIRED OTHER : CANOPY REMOVED PER PERMIT #101003272 REAR YARD-OVERHEAD CANOPY ROOF INSTALLED BETWE... Failed to obtain building permit for erecting,... ... MAJOR AVE 123856 NaN 41.932068 -87.768851 (41.93206815540178, -87.76885130741938) 2022 1 12
1828 6730802 06/22/2023 01/12/2022 NC2071 COMPLIED 06/22/2023 WORK NOT ALLOWED OTHER : :OTHER REAR YARD-6-'4" HIGH VINYL FENCE INSTALLED AT ... Remove work performed without permit and resto... ... MAJOR AVE 123856 NaN 41.932068 -87.768851 (41.93206815540178, -87.76885130741938) 2022 1 12
1829 6730803 06/22/2023 01/12/2022 NC2020 COMPLIED 06/22/2023 PERMIT REQUIRED OTHER : :OTHER REAR YARD-OVERHEAD CANOPY ROOF INSTALLED BETWE... Failed to obtain building permit for erecting,... ... MAJOR AVE 123856 NaN 41.932068 -87.768851 (41.93206815540178, -87.76885130741938) 2022 1 12
1830 6735084 06/22/2023 01/04/2022 CN132016 OPEN NaN HEAT UNIT ADEQUATELY NaN LIVING ROOM 59 DEGREES, BEDROOM 60 DEGREES. GA... Heat dwelling unit adequately from September 1... ... WENTWORTH AVE 20105 NaN 41.766092 -87.629642 (41.76609247175032, -87.62964151748226) 2022 1 4
1831 6735085 06/22/2023 01/04/2022 CN132046 OPEN NaN SAFE WORKING CONDITIONS NaN FURNANCE OUT OF SERVICE AT TIME AT TIME OF INS... Provide and maintain every facility, piece of ... ... WENTWORTH AVE 20105 NaN 41.766092 -87.629642 (41.76609247175032, -87.62964151748226) 2022 1 4

1832 rows × 29 columns

Let’s poke around the data just a little:

violations_df["VIOLATION DATE"].value_counts().sort_index()
VIOLATION DATE
01/03/2022    205
01/04/2022    147
01/05/2022    160
01/06/2022    107
01/07/2022    229
01/08/2022    121
01/09/2022     27
01/10/2022    153
01/11/2022    127
01/12/2022    180
01/13/2022    119
01/14/2022    208
01/15/2022     49
Name: count, dtype: int64

These dates are from the first two weeks of 2022.

Building inspectors don’t record many violations on Sundays.

Let us look at the VIOLATION DESCRIPTION field:

violations_df["VIOLATION DESCRIPTION"].value_counts()
VIOLATION DESCRIPTION
MAINTAIN OR REPAIR ELECT ELEVA    222
MAINTAIN OR REPAIR HYDRO ELEVA    123
HEAT UNIT ADEQUATELY               89
SAFE WORKING CONDITIONS            64
ARRANGE PREMISE INSPECTION         57
                                 ... 
FIRE EXTNGSHR REQ, NONRESDNTL       1
ARRANGE FOR INSPECTION              1
ACCESS TO ROOF                      1
CEILING HEIGHT - 7'                 1
NO CLOSET/STORAGE UNDER STAIR       1
Name: count, Length: 233, dtype: int64
violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts()
VIOLATION INSPECTOR COMMENTS
INSPECTION                                                                                                                                                                                                                                                             7
STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035                                                                                                                                                                    5
SCHEDULE ELEVATOR INSPECTION WITH CITY OF CHICAGO ELEVATOR BUREAU                                                                                                                                                                                                      4
1ST FLOOR/RESIDENTIAL ENTRY DOOR - GAPS AT BOTTOM OF DOOR ALLOWING RATS TO ENTER BUILDING                                                                                                                                                                              4
INTERIOR INSPECTION REQUIRED TO DETERMINE COMPLETE SCOPE OF WORK FOR NEW PERMIT APPLICATION.                                                                                                                                                                           4
                                                                                                                                                                                                                                                                      ..
NO ENTRY TO INSP FOR CONDITIONS AND DETECTORS . WENT ON HEAT ALSO NO ENTRY                                                                                                                                                                                             1
EAST ELEVATION, 3 STORY OPEN WOOD PORCH SYSTEM - ALL STAIR RISER BOARDS MISSING. SOME RUSTED JOIST HANGERS AT ALL FLOORS. SOME PICKETS LOOSE. SOME NAILED CONNECTIONS RUSTED AND PULLING.                                                                              1
916 APT 2 / BEDROOM RADIATOR - COLD TO TOUCH. 914 APT 1 / BEDROOM RADIATOR -  ICE COLD. RADIATORS -  NEED SERVICING, HEAT DOES NOT GO THROUGH AT TIME OF INSPECTION. CB132046 - SAFE WORKING CONDITIONS (14X-3-305.1, 14X-8-804.1 THRU 14X-8-804.4 AND 14X-8-804.5)    1
2ND FL APT / SMOKE DETECTOR - MISSING                                                                                                                                                                                                                                  1
FURNANCE OUT OF SERVICE AT TIME AT TIME OF INSPECTION.                                                                                                                                                                                                                 1
Name: count, Length: 1580, dtype: int64

Ok, so 1580 different comments for 1832 rows. Are any values missing?

len(violations_df), violations_df["VIOLATION INSPECTOR COMMENTS"].isnull().sum()
(1832, 98)

Just counting, we have 1832 records, 98 of which (5.3%) are missing inspector comments. Of the remaining 1734 records with comments, there are 1579 different comments. This means there are 253 more records than distinct comments, so 13.8% of the comments are exact duplicates somewhere within the dataset.

Some pairs of records, however, have non-identical comments that probably should be identical. We can use the Levenshtein distance to identify similar but not exactly identical comments; we will make some decisions about how to merge records, and produce a dataset that has had the text of similar comments merged.

All-against all comment comparison#

First, we will compare all the distinct comments to all the other comments.
(This requires \(n^2\) comparisons, where n is the 1832 distinct comments in the VIOLLATION INSPECTOR COMMENTS field.) We can afford 3.3 million comparisons between strings.

import Levenshtein
from Levenshtein import distance
distinctcomments = violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts().index
distinctcommentnumber = violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts().values

for i in range(len(distinctcomments)):
    for j in range(len(distinctcomments)):
        d = distance(distinctcomments[i], distinctcomments[j])
        # Find all the pairs that have Levenshtein distance 1 
        if d == 1:  
            # And print out the number of times and each of the pair.
            print(distinctcommentnumber[i], distinctcomments[i])
            print(distinctcommentnumber[j], distinctcomments[j])
5 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035
1 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035
4 CHIMNEY - MISSING CAP
1 CHIMNEY -  MISSING CAP
4 USING COOKING STOVE AS HEATING DEVICE.
1 USING COOKING STOVE AS HEATING DEVICE
3 WORKING DOOR RESTRICTOR REQUIRED.
1 WORKING DOOR RESTRICTOR REQIRED.
2 STOVE HEAT/SPACE HEATERS.
1 STOVE HEAT/SPACE HEATER.
2 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS
1 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.
2 MISSING SMOKE DETECTORS.
1 MISSING SMOKE DETECTOR.
2 AT BASEMENT
1 AT BASEMENT.
2 A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.
1 AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.
1 PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS  (14A-4-404.6)
1 PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)
1 STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-1 14A-4-411-1
1 STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-4 14A-4-411-1
1 MISSING SMOKE DETECTOR.
2 MISSING SMOKE DETECTORS.
1 MISSING SMOKE DETECTOR.
1 MISSING SMOKE DETECTOR
1 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.
2 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERIFIED OCCUPANCY.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED OCCUPANCY.
1 CAT. 1 TEST OVERDUE.
1 CAT. 5 TEST OVERDUE.
1 CAT. 5 TEST OVERDUE.
1 CAT. 1 TEST OVERDUE.
1 WORKING DOOR RESTRICTOR REQIRED.
3 WORKING DOOR RESTRICTOR REQUIRED.
1 REPAIR EMERGENCY PHONES
1 REPAIR EMERGENCY PHONE.
1 SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINTS.
1 SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINT.
1 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035
5 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035
1 CODE DATA TAG ON CONTROLLER
1 CODE DATA TAG ON CONTROLER
1 ELEVATORS #4 & #5: PERFORM CAT1 & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.
1 ELEVATORS #4 & #5: PERFORM CAT! & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.
1 STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-4 14A-4-411-1
1 STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-1 14A-4-411-1
1 PERFORM CAT1 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.
1 PERFORM CAT5 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.
1 PROVIDE MAINTENANCE CONTROL PROGRAM
1 PROVIDE MAINTENANCE CONTROL PROGRAM.
1 PROVIDE MAINTENANCE CONTROL PROGRAM.
1 PROVIDE MAINTENANCE CONTROL PROGRAM
1 STOVE HEAT/SPACE HEATER.
2 STOVE HEAT/SPACE HEATERS.
1 PERFORM CAT5 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.
1 PERFORM CAT1 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.
1 CHIMNEY -  MISSING CAP
4 CHIMNEY - MISSING CAP
1 AT BASEMENT.
2 AT BASEMENT
1 USING COOKING STOVE AS HEATING DEVICE
4 USING COOKING STOVE AS HEATING DEVICE.
1 REPAIR EMERGENCY PHONE.
1 REPAIR EMERGENCY PHONES
1 ELEVATORS #4 & #5: PERFORM CAT! & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.
1 ELEVATORS #4 & #5: PERFORM CAT1 & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.
1 CODE DATA TAG ON CONTROLER
1 CODE DATA TAG ON CONTROLLER
1 MISSING SMOKE DETECTOR
1 MISSING SMOKE DETECTOR.
1 PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)
1 PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS  (14A-4-404.6)
1 AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.
2 A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED OCCUPANCY.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERIFIED OCCUPANCY.
1 SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINT.
1 SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINTS.

Note: This all-against all comparison computes both

\[ d(\textrm{comment}_i, \textrm{comment}_j) \]

and

\[ d(\textrm{comment}_j, \textrm{comment}_i) \]

which are always equal. So this double-counts all the pairs. Note also we aren’t storing the distances, just calculating all of them and taking action for some values of the distance.

Glancing at these pairs of comments, somme are misspellings (UNVERFIED, CAT!, AE) and some are differences in punctuation. Surely, CHIMNEY - MISSING CAP is not meaningfully different from CHIMNEY -  MISSING CAP with two spaces after the hyphen. These are meaningless differences.

A handful of the differences, however, may be meaningful. CAT1 and CAT5 are different inspection schedules for different types of elevators, and the two comments have a distance of only 1. Differences in addresses, directions, permit numbers and cardinal directions can have small edit distances but can change the meaning.

Creating a new field that merges comments that are more similar than a certain edit distance risks corrupting some of the data (SOUTH and NORTH have edit distance 2). We should always preserve the original data when “cleaning” in this way. But merging comments that mean the same thing will make it easier to find groups and understand the data by removing differences (in punctuation and spelling) that are in no way relevant to the inspections.

To merge the similar-but-not-identical comments, we will choose a threshold for merging comments, create a dictionary that maps old comments to new cleaned up comments, write a function that uses this dictionary to replace the comments that are on the list, apply this function, and finally check that we get what we expect.

First, let us decide on a threshold. What do comments look like at d=4?

for i in range(len(distinctcomments)):
    for j in range(len(distinctcomments)):
        d = distance(distinctcomments[i], distinctcomments[j])
        if d == 4:   # Find all the pairs that have Levenshtein distance 1 
            print(distinctcommentnumber[i], distinctcomments[i])
            print(distinctcommentnumber[j], distinctcomments[j])
2 INTERIOR OF BUILDING - ROACH INFESTATION
2 INTERIOR OF BUILDING - MICE INFESTATION
2 INTERIOR OF BUILDING - MICE INFESTATION
2 INTERIOR OF BUILDING - ROACH INFESTATION
2 STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411
1 STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY -  FRACTURES , LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY - FRACTURES ,LOOSE MORTAR
1 NORTH CORNICE - WASHED OUT MORTAR.
1 WEST CORNICE - WASHED OUT MORTAR.
1 WEST CORNICE - WASHED OUT MORTAR.
1 NORTH CORNICE - WASHED OUT MORTAR.
1 CHIMNEY -  FRACTURES , LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411
2 STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411
1 CHIMNEY - FRACTURES ,LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR

These all look pretty solidly identical. The insertion of the word “NEW”, the replacement of ” ,” with ” AND “. We can merge these.

What about d=6?

for i in range(len(distinctcomments)):
    for j in range(len(distinctcomments)):
        d = distance(distinctcomments[i], distinctcomments[j])
        if d == 6:   # Find all the pairs that have Levenshtein distance 1 
            print(distinctcommentnumber[i], distinctcomments[i])
            print(distinctcommentnumber[j], distinctcomments[j])
3 WEST, EAST, NORTH, ELEVATIONS - ABANDONED VEHICLE IN REAR YARD, BROKEN AWNING ON FRONT PORCH; LOTS OF JUNK, GARBAGE, APPLIANCES, BAGS OF CONCRETE, SHOPPING CARTS, AUTO PARTS AND TIRES, TELEVISION, BROKEN DOORS, TRASH, DEBRIS, ETC. THROUGHOUT; EXCESSIVE C
2 WEST, EAST, NORTH, ELEVATIONS - ABANDONED VEHICLE IN REAR YARD, BROKEN AWNING ON FRONT PORCH; LOTS OF JUNK, GARBAGE, APPLIANCES, BAGS OF CONCRETE, SHOPPING CARTS, AUTO PARTS AND TIRES, TELEVISION, BROKEN DOORS, TRASH, DEBRIS, ETC. THROUGHOUT; EXCESSIVE CLUTTER
2 NO SMOKE DETECTORS.
2 MISSING SMOKE DETECTORS.
2 WEST, EAST, NORTH, ELEVATIONS - ABANDONED VEHICLE IN REAR YARD, BROKEN AWNING ON FRONT PORCH; LOTS OF JUNK, GARBAGE, APPLIANCES, BAGS OF CONCRETE, SHOPPING CARTS, AUTO PARTS AND TIRES, TELEVISION, BROKEN DOORS, TRASH, DEBRIS, ETC. THROUGHOUT; EXCESSIVE CLUTTER
3 WEST, EAST, NORTH, ELEVATIONS - ABANDONED VEHICLE IN REAR YARD, BROKEN AWNING ON FRONT PORCH; LOTS OF JUNK, GARBAGE, APPLIANCES, BAGS OF CONCRETE, SHOPPING CARTS, AUTO PARTS AND TIRES, TELEVISION, BROKEN DOORS, TRASH, DEBRIS, ETC. THROUGHOUT; EXCESSIVE C
2 MISSING SMOKE DETECTORS.
2 NO SMOKE DETECTORS.
1 PANELS
1 NO HEAT
1 PANELS
1 X
1 NO HEAT
1 PANELS
1 NO HEAT
1 FIRE TEST
1 MISSING CARBON MONOXIDE DETECTOR.
1 NO CARBON MONOXIDE DETECTOR.
1 NO CARBON MONOXIDE DETECTOR.
1 MISSING CARBON MONOXIDE DETECTOR.
1 NORTH AND SOUTH ELEVATIONS - WASHED OUT MORTAR AND SPALLING BRICKS.
1 SOUTH AND WEST ELEVATIONS - WASHED OUT MORTAR AND SPALLING BRICKS.
1 EMERGENCY PHONE ISN'T WORKING. CAR 1
1 EMERGENCY PHONE ISN'T WORKING. ALL CARS
1 SOUTH AND WEST ELEVATIONS - WASHED OUT MORTAR AND SPALLING BRICKS.
1 NORTH AND SOUTH ELEVATIONS - WASHED OUT MORTAR AND SPALLING BRICKS.
1 EAST ELEVATION / PARAPET - STEP CRACKS .
1 WEST ELEVATION / GARAGE  - STEP CRACKS.
1 REPAIR BOTH EMERGENCY PHONES
1 REPAIR EMERGENCY PHONE.
1 WORKING PIT LIGHTING AND AN ACCESSIBLE LIGHT SWITCH IS REQUIRED.
1 WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH REQUIRED.
1 SECURE UNSUPPORTED CONDUITS IN BASEMENT AND WHERE REQUIRED IN TYPICAL SPACES AND AREAS THROUGHOUT PREMISES
1 SECURE UNSUPPORTED BOXES IN BASEMENT AND WHERE REQUIRED IN TYPICAL SPACES AND AREAS THROUGHOUT PREMISES
1 SECURE UNSUPPORTED BOXES IN BASEMENT AND WHERE REQUIRED IN TYPICAL SPACES AND AREAS THROUGHOUT PREMISES
1 SECURE UNSUPPORTED CONDUITS IN BASEMENT AND WHERE REQUIRED IN TYPICAL SPACES AND AREAS THROUGHOUT PREMISES
1 PROVIDE FIRE EXTINGUISHER INSIDE MACHINE ROOM.
1 PROVIDE FIRE EXTINGUSHER IN MACHINE ROOM
1 FIRE TEST
1 NO HEAT
1 WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH REQUIRED.
1 WORKING PIT LIGHTING AND AN ACCESSIBLE LIGHT SWITCH IS REQUIRED.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED COMPLAINT OF REAR PORCH SLANTED.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED COMPLAINT OF REAR PORCH UNSAFE.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED COMPLAINT OF REAR PORCH UNSAFE.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED COMPLAINT OF REAR PORCH SLANTED.
1 WEST ELEVATION / GARAGE  - STEP CRACKS.
1 EAST ELEVATION / PARAPET - STEP CRACKS .
1 REPAIR EMERGENCY PHONE.
1 REPAIR BOTH EMERGENCY PHONES
1 PROVIDE FIRE EXTINGUSHER IN MACHINE ROOM
1 PROVIDE FIRE EXTINGUISHER INSIDE MACHINE ROOM.
1 X
1 PANELS
1 EMERGENCY PHONE ISN'T WORKING. ALL CARS
1 EMERGENCY PHONE ISN'T WORKING. CAR 1

This is clearly too much. NO HEAT should not be merged with PANELS or FIRE TEST.

What about d=5?

for i in range(len(distinctcomments)):
    for j in range(len(distinctcomments)):
        d = distance(distinctcomments[i], distinctcomments[j])
        if d == 5:   # Find all the pairs that have Levenshtein distance 1 
            print(distinctcommentnumber[i], distinctcomments[i])
            print(distinctcommentnumber[j], distinctcomments[j])
2 SOUTH ELEVATION / DOWNSPOUT - MISSING.
2 WEST ELEVATION / DOWNSPOUT - MISSING
2 REAR EXTERIOR STAIRS,GRADE TO BASEMENT-MISSING HAND RAIL.
2 FRONT EXTERIOR STAIRS-GRADE TO BASEMENT-MISSING HAND RAIL.
2 FRONT EXTERIOR STAIRS-GRADE TO BASEMENT-MISSING HAND RAIL.
2 REAR EXTERIOR STAIRS,GRADE TO BASEMENT-MISSING HAND RAIL.
2 CODE COMPLIANT PIT LADDER REQUIRED.
1 A CODE COMPLIANT PIT LADDER IS REQUIRED.
2 WEST ELEVATION / DOWNSPOUT - MISSING
2 SOUTH ELEVATION / DOWNSPOUT - MISSING.
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 CHIMNEY -  FRACTURES , LOOSE MORTAR
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 CHIMNEY - FRACTURES ,LOOSE MORTAR
1 BUILDING INFESTED WITH MICE, INCLUDING APT. 2E.
1 BUILDING INFESTED WITH ROACHES, INCLUDING APT. 2E.
1 BUILDING INFESTED WITH ROACHES, INCLUDING APT. 2E.
1 BUILDING INFESTED WITH MICE, INCLUDING APT. 2E.
1 EAST PASSENGER: REPAIR EMERGENCY PHONE
1 EAST PASSENGER: REPAIR EMERGENCY ALARM
1 SOUTH ELEVATION / COPING - OPEN MORTAR JOINTS.
1 SOUTH ELEVATION / CORNICE - OPEN MORTAR JOINT.
1 EAST PASSENGER: REPAIR EMERGENCY ALARM
1 EAST PASSENGER: REPAIR EMERGENCY PHONE
1 APT 2R / CARBON MONOXIDE - MISSING . PER SECOND CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)
1 APT 1 / CARBON MONOXIDE - MISSING . PER SECTION CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)
1 REPAIR BOTH EMERGENCY PHONES
1 REPAIR EMERGENCY PHONES
1 SOUTH ELEVATION / CORNICE - OPEN MORTAR JOINT.
1 SOUTH ELEVATION / COPING - OPEN MORTAR JOINTS.
1 REPAIR EMERGENCY PHONES
1 REPAIR BOTH EMERGENCY PHONES
1 (PREVIOUSLY CITED ON 8/14/18) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS
1 (PREVIOUSLY CITED ON 9/7/16) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS.
1 CHIMNEY -  FRACTURES , LOOSE MORTAR
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 WEST ELEVATION / EXTERIOR WALL -  MASONRY FRACTURES.
1 NORTH ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.
1 (PREVIOUSLY CITED ON 9/7/16) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS.
1 (PREVIOUSLY CITED ON 8/14/18) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS
1 A CODE COMPLIANT PIT LADDER IS REQUIRED.
2 CODE COMPLIANT PIT LADDER REQUIRED.
1 NORTH ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.
1 WEST ELEVATION / EXTERIOR WALL -  MASONRY FRACTURES.
1 CHIMNEY - FRACTURES ,LOOSE MORTAR
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 APT 1 / CARBON MONOXIDE - MISSING . PER SECTION CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)
1 APT 2R / CARBON MONOXIDE - MISSING . PER SECOND CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)

These all look similar enough to merge. Let us merge comments with d less than or equal 5.

Apply the changes#

renamedict = {}

for i in range(len(distinctcomments)):
    for j in range(len(distinctcomments)):
        d = distance(distinctcomments[i], distinctcomments[j])
        # Find all the pairs that have Levenshtein distance  <=5 
        if d <= 5 and d !=0  :  
            # If comment i is used fewer times than j   
            if distinctcommentnumber[i] < distinctcommentnumber[j] :  
            # Flag comment i to be changed to comment j 
                 renamedict[distinctcomments[i]] = distinctcomments[j]
renamedict, len(renamedict)
({'MISSING SMOKE DETECTOR.': 'MISSING SMOKE DETECTORS.',
  'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.': 'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS',
  'STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
  'WORKING DOOR RESTRICTOR REQIRED.': 'WORKING DOOR RESTRICTOR REQUIRED.',
  'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035': 'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035',
  'STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
  'STOVE HEAT/SPACE HEATER.': 'STOVE HEAT/SPACE HEATERS.',
  'CHIMNEY -  MISSING CAP': 'CHIMNEY - MISSING CAP',
  'AT BASEMENT.': 'AT BASEMENT',
  'USING COOKING STOVE AS HEATING DEVICE': 'USING COOKING STOVE AS HEATING DEVICE.',
  'A CODE COMPLIANT PIT LADDER IS REQUIRED.': 'CODE COMPLIANT PIT LADDER REQUIRED.',
  'CHIMNEYS - MISSING CAPS': 'CHIMNEY - MISSING CAP',
  'MISSING SMOKE DETECTOR': 'MISSING SMOKE DETECTORS.',
  'AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.': 'A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.'},
 14)

This looks reasonable. Write the function:

def renamecomment(comment):
    if comment in renamedict.keys():
        return renamedict[comment]
    else:
        return comment

I don’t want to write a function without some kind of testing to see that it does what I think it does. I’ll write two tests, simple invocations of renamecomment with known correct answers.

renamecomment("NOT IN DICTIONARY") == "NOT IN DICTIONARY"
True
renamecomment('MISSING SMOKE DETECTOR') == 'MISSING SMOKE DETECTORS.'
True

Ok, seems to work. Let us create a new column with apply:

violations_df["COMMENTS FIXED"] = violations_df["VIOLATION INSPECTOR COMMENTS"].apply(renamecomment)

And how do we know that this did anything? Let’s check which rows were changed:

violations_df.query("`COMMENTS FIXED` != `VIOLATION INSPECTOR COMMENTS`")
ID VIOLATION LAST MODIFIED DATE VIOLATION DATE VIOLATION CODE VIOLATION STATUS VIOLATION STATUS DATE VIOLATION DESCRIPTION VIOLATION LOCATION VIOLATION INSPECTOR COMMENTS VIOLATION ORDINANCE ... STREET TYPE PROPERTY GROUP SSA LATITUDE LONGITUDE LOCATION YEAR MONTH DAY COMMENTS FIXED
5 6726415 03/08/2022 01/03/2022 VT1010 COMPLIED 03/08/2022 ARRANGE COMPLETED PERMIT INSP. NaN NaN Arrange mechanical ventilation or warm air hea... ... AVE 198348 22.0 41.980342 -87.669580 (41.98034244999325, -87.66958046014629) 2022 1 3 NaN
22 6732325 09/21/2022 01/11/2022 VT1030 COMPLIED 09/20/2022 ARRANGE PERMIT INSP. 1115 N CHRISTIANA AV NaN Arrange in writing for mechanical ventilating ... ... AVE 51247 NaN 41.901508 -87.710433 (41.901507612598834, -87.71043339750256) 2022 1 11 NaN
23 6735146 03/11/2022 01/09/2022 CN046013 COMPLIED 03/11/2022 UNAPPROVED HEATING DEVICE NaN STOVE HEAT/SPACE HEATER. Stop using cooking or water heating device as ... ... AVE 21765 NaN 41.691177 -87.612259 (41.691177471896076, -87.61225906980805) 2022 1 9 STOVE HEAT/SPACE HEATERS.
24 6727334 01/04/2022 01/04/2022 PL151137 OPEN NaN OPEN NaN NaN NaN ... AVE 400614 14.0 41.770689 -87.685700 (41.77068909096362, -87.68569998583654) 2022 1 4 NaN
25 6730399 01/12/2022 01/11/2022 PLD2290 OPEN NaN PROVIDE AUTOMATIC MIXING DEVIC NaN NaN Provide automatic mixing device for public lav... ... ST 1596 NaN 41.887222 -87.757826 (41.887221537144235, -87.75782634562349) 2022 1 11 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1598 6729116 08/31/2022 01/10/2022 VT1010 COMPLIED 08/31/2022 ARRANGE COMPLETED PERMIT INSP. NaN NaN Arrange mechanical ventilation or warm air hea... ... AVE 156823 NaN 41.947644 -87.781798 (41.947644039902265, -87.78179844629967) 2022 1 10 NaN
1618 6730385 11/21/2022 01/11/2022 VT1010 COMPLIED 11/18/2022 ARRANGE COMPLETED PERMIT INSP. NaN NaN Arrange mechanical ventilation or warm air hea... ... ST 286211 NaN 41.841823 -87.701135 (41.84182295978573, -87.70113477596593) 2022 1 11 NaN
1621 6726984 12/07/2022 01/03/2022 PL151137 COMPLIED 12/07/2022 OPEN NaN NaN NaN ... AVE 60561 NaN 41.904908 -87.676385 (41.904907565709664, -87.67638534234142) 2022 1 3 NaN
1627 6728833 11/15/2022 01/07/2022 VT1010 OPEN NaN ARRANGE COMPLETED PERMIT INSP. NaN NaN Arrange mechanical ventilation or warm air hea... ... AVE 117688 NaN 41.933936 -87.715618 (41.933936312229164, -87.71561838752238) 2022 1 7 NaN
1774 6728033 05/25/2023 01/03/2022 EV1111 OPEN NaN MAINTAIN OR REPAIR HYDRO ELEVA NaN WORKING DOOR RESTRICTOR REQIRED. Failed to maintain hydraulic elevator equipmen... ... BLVD 14097 NaN 42.018888 -87.684722 (42.01888785624424, -87.68472174840898) 2022 1 3 WORKING DOOR RESTRICTOR REQUIRED.

112 rows × 30 columns

112 rows were changed. Hmm. So what did this do to the number of unique comments? We expect something like 100 fewer unique comments.. right?

print("Records:               ", len(violations_df))
print("Blank records:         ", violations_df["VIOLATION INSPECTOR COMMENTS"].isnull().sum())
print("Distinct records:      ", len(violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts()) ) 
print("Distinct records fixed:", len(violations_df["COMMENTS FIXED"].value_counts()))
Records:                1832
Blank records:          98
Distinct records:       1580
Distinct records fixed: 1566

This is a little puzzling. The renaming changed 112 rows but only reduced the number of unique entries by 14.
This is not what I was expecting; something is not working quite as it should. Let’s look at some of the 112 rows that were changed more closely:

changed = violations_df.query("`COMMENTS FIXED` != `VIOLATION INSPECTOR COMMENTS`") 
changed[["VIOLATION INSPECTOR COMMENTS", "COMMENTS FIXED"]] 
VIOLATION INSPECTOR COMMENTS COMMENTS FIXED
5 NaN NaN
22 NaN NaN
23 STOVE HEAT/SPACE HEATER. STOVE HEAT/SPACE HEATERS.
24 NaN NaN
25 NaN NaN
... ... ...
1598 NaN NaN
1618 NaN NaN
1621 NaN NaN
1627 NaN NaN
1774 WORKING DOOR RESTRICTOR REQIRED. WORKING DOOR RESTRICTOR REQUIRED.

112 rows × 2 columns

This is disappointing. The 112 includes NaNs, since NaN does not match with NaN. Show only the non-NaN rows:

changed[~changed["COMMENTS FIXED"].isnull()][["VIOLATION INSPECTOR COMMENTS", "COMMENTS FIXED"]] 
VIOLATION INSPECTOR COMMENTS COMMENTS FIXED
23 STOVE HEAT/SPACE HEATER. STOVE HEAT/SPACE HEATERS.
34 STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION... STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A...
168 AT BASEMENT. AT BASEMENT
204 CHIMNEY - MISSING CAP CHIMNEY - MISSING CAP
529 AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE... A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO B...
777 USING COOKING STOVE AS HEATING DEVICE USING COOKING STOVE AS HEATING DEVICE.
802 MISSING SMOKE DETECTOR MISSING SMOKE DETECTORS.
876 A CODE COMPLIANT PIT LADDER IS REQUIRED. CODE COMPLIANT PIT LADDER REQUIRED.
898 CHIMNEYS - MISSING CAPS CHIMNEY - MISSING CAP
1104 STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 1... STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A...
1188 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR... EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR...
1201 MISSING SMOKE DETECTOR. MISSING SMOKE DETECTORS.
1485 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS... STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS...
1774 WORKING DOOR RESTRICTOR REQIRED. WORKING DOOR RESTRICTOR REQUIRED.

These are the 14 that were changed.

Fix doesn’t fix enough comments#

Something doesn’t quite add up here. Let me count the number of pairs that were found that were close enough to be merged.

renamedict = {}
n=0
for i in range(len(distinctcomments)):
    for j in range(len(distinctcomments)):
        d = distance(distinctcomments[i], distinctcomments[j])
        # Find all the pairs that have Levenshtein distance  <=5 
        if d <= 5 and d !=0  :
            n = n+1
            # If comment i is used fewer times than j   
            if distinctcommentnumber[i] < distinctcommentnumber[j] :  
            # Flag comment i to be changed to comment j 
                 renamedict[distinctcomments[i]] = distinctcomments[j]
renamedict, len(renamedict), n
({'MISSING SMOKE DETECTOR.': 'MISSING SMOKE DETECTORS.',
  'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.': 'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS',
  'STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
  'WORKING DOOR RESTRICTOR REQIRED.': 'WORKING DOOR RESTRICTOR REQUIRED.',
  'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035': 'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035',
  'STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
  'STOVE HEAT/SPACE HEATER.': 'STOVE HEAT/SPACE HEATERS.',
  'CHIMNEY -  MISSING CAP': 'CHIMNEY - MISSING CAP',
  'AT BASEMENT.': 'AT BASEMENT',
  'USING COOKING STOVE AS HEATING DEVICE': 'USING COOKING STOVE AS HEATING DEVICE.',
  'A CODE COMPLIANT PIT LADDER IS REQUIRED.': 'CODE COMPLIANT PIT LADDER REQUIRED.',
  'CHIMNEYS - MISSING CAPS': 'CHIMNEY - MISSING CAP',
  'MISSING SMOKE DETECTOR': 'MISSING SMOKE DETECTORS.',
  'AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.': 'A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.'},
 14,
 116)

There were 116 pairs of comments (out of 3.3 million) that were closer than d = 5 but the loop only left instructions to rename 14 comments? I need to look closely at the logic here:

Let’s talk through the logic here.

Which comments get renamed? Comments that are more than 0 but 5 or fewer edits away from another comment, and for which the other comment has strictly greater abundance.

Aha. For this dataset, most of the comments are unique, and most of the pairs of similar comments have both comments unique. Comments that could be merged, but that are present only once are ignored.

Let us add logic that merges comments even if they have the same number of occurrences. We need an arbitrary rule to pick one comment when the counts are tied: let us choose the one that is first in alphabetical order. This is a canonicalization choice: we introduce an arbitrary choice between two similar comments to allow consolidation without having to make a choice that requires human input, as choosing the correct spelling would.

renamedict = {}
n=0
for i in range(len(distinctcomments)):
    for j in range(len(distinctcomments)):
        d = distance(distinctcomments[i], distinctcomments[j])
        # Find all the pairs that have Levenshtein distance  <=5 
        if d <= 5 and d !=0  :
            n = n+1
            # If comment i is used fewer times than j   
            if distinctcommentnumber[i] < distinctcommentnumber[j] :  
            # Flag comment i to be changed to comment j 
                 renamedict[distinctcomments[i]] = distinctcomments[j]
            # If two comments have the same number of counts:
            if distinctcommentnumber[i] == distinctcommentnumber[j]:
                # Rename if j comes first in lexicographic (alphabetical) order:
                if distinctcomments[j] <  distinctcomments[i]:
                    renamedict[distinctcomments[i]] = distinctcomments[j]
renamedict, len(renamedict), n
({'REAR EXTERIOR STAIRS,GRADE TO BASEMENT-MISSING HAND RAIL.': 'FRONT EXTERIOR STAIRS-GRADE TO BASEMENT-MISSING HAND RAIL.',
  'INTERIOR OF BUILDING - ROACH INFESTATION': 'INTERIOR OF BUILDING - MICE INFESTATION',
  'WEST ELEVATION / DOWNSPOUT - MISSING': 'SOUTH ELEVATION / DOWNSPOUT - MISSING.',
  'MISSING CARBON MONOXIDE DETECTOR.': 'MISSING CARBOB MONOXIDE DETECTOR',
  'MISSING SMOKE DETECTOR.': 'MISSING SMOKE DETECTOR',
  'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.': 'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS',
  'CHIMNEY - FRACTURES WITH LOOSE MORTAR': 'CHIMNEY - FRACTURES ,LOOSE MORTAR',
  'CHIMNEY - FRACTURES AND LOOSE MORTAR': 'CHIMNEY - FRACTURES ,LOOSE MORTAR',
  'INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERIFIED OCCUPANCY.': 'INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED OCCUPANCY.',
  'BUILDING INFESTED WITH ROACHES, INCLUDING APT. 2E.': 'BUILDING INFESTED WITH MICE, INCLUDING APT. 2E.',
  'STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
  'EAST PASSENGER: REPAIR EMERGENCY PHONE': 'EAST PASSENGER: REPAIR EMERGENCY ALARM',
  'ERECTING NEW INTERIOR PARTITIONS': 'ERECTED NEW INTERIOR PARTITIONS',
  'APT 2R / CARBON MONOXIDE - MISSING . PER SECOND CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)': 'APT 1 / CARBON MONOXIDE - MISSING . PER SECTION CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)',
  '#15 & #16 ELEVATORS MUST BE REPRESENTED ON THE ELEVATOR CONTROL PANEL (LOBBY PANEL) INCLUDING AUXILARY PHASE I KEY SWITCH(TO RECALL #15, #16, AND #17 ALL SHARING THE SAME MACHINE ROOM), POSITION INDICATOR, INDICATOR OF OPERATING MODES, AND REPRESENTATION ON THE ETCHED DEPICTION (BUILDING LAYOUT MAP) OF THE BUILDING WITH ALL OF THE ELEVATORS WITHIN': '#15 & #16 ELEVATORS MUST BE REPRESENTED ON THE ELEVATOR CONTROL PANEL (LOBBY PANEL) INCLUDING AUXILARY PHASE I KEY SWITCH(TO RECALL #15, #16, AND #17 ALL SHARING THE SAME MACHINE ROOM), POSITION INDICATOR, INDICATOR OF OPERATING MODES, AND REEPRESENTATION ON THE ETCHED DEPICTION (BUILDING LAYOUT MAP) OF THE BUILDING WITH ALL OF THE ELEVATORS WITHIN.',
  'CAT. 5 TEST OVERDUE.': 'CAT. 1 TEST OVERDUE.',
  'NEXT CATEGORY 5 TEST DUE BY 06/02/2022.': 'NEXT CATEGORY 1 TEST DUE BY 02/04/2022.',
  'WEST CORNICE - WASHED OUT MORTAR.': 'NORTH CORNICE - WASHED OUT MORTAR.',
  'SOUTH ELEVATION / CORNICE - OPEN MORTAR JOINT.': 'SOUTH ELEVATION / COPING - OPEN MORTAR JOINTS.',
  'WORKING DOOR RESTRICTOR REQIRED.': 'WORKING DOOR RESTRICTOR REQUIRED.',
  'WORKING PIT LIGHTING AND AN ACCESSIBLE LIGHT SWITCH IS REQUIRED.': 'WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH IS REQUIRED.',
  'REPAIR EMERGENCY PHONES': 'REPAIR EMERGENCY PHONE.',
  'SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINTS.': 'SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINT.',
  'STENCIL PIT FLOOR INFRONT OF COUNTERWEIGHT/PROVIDE SIGNAGE ON PIT WALL RUNBY/DANGER': 'STENCIL PIT FLOOR IN FROT OF COUNTERWEIGHT/PROVIDE SIGNAGE ON PIT WALL RUNBY/DANGER',
  'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035': 'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035',
  'CODE DATA TAG ON CONTROLLER': 'CODE DATA TAG ON CONTROLER',
  'ELEVATORS #4 & #5: PERFORM CAT1 & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.': 'ELEVATORS #4 & #5: PERFORM CAT! & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.',
  'WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH REQUIRED.': 'WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH IS REQUIRED.',
  'STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-4 14A-4-411-1': 'STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-1 14A-4-411-1',
  'PERFORM CAT5 SAFETY TEST AND LEAVE CITY DOCUMENTATION.': 'PERFORM CAT1 SAFETY TEST AND LEAVE CITY DOCUMENTATION',
  'PROVIDE MAINTENANCE CONTROL PROGRAM.': 'PROVIDE MAINTENANCE CONTROL PROGRAM',
  'STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 14A-4-411',
  'STOVE HEAT/SPACE HEATER.': 'STOVE HEAT/SPACE HEATERS.',
  'WEST ELEVATION / EXTERIOR WALL -  MASONRY FRACTURES.': 'NORTH ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.',
  'PERFORM CAT5 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.': 'PERFORM CAT1 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.',
  'CHIMNEY -  MISSING CAP': 'CHIMNEY - MISSING CAP',
  'WEST WINDOW SILLS - WASHED OUT MORTAR.': 'EAST WINDOW SILLS - WASHED OUT MORTAR.',
  'REAR YARD - ABANDONED CARS': 'REAR YARD - ABANADONED CAR.',
  'AT BASEMENT.': 'AT BASEMENT',
  'USING COOKING STOVE AS HEATING DEVICE': 'USING COOKING STOVE AS HEATING DEVICE.',
  '(PREVIOUSLY CITED ON 9/7/16) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS.': '(PREVIOUSLY CITED ON 8/14/18) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS',
  'A CODE COMPLIANT PIT LADDER IS REQUIRED.': 'CODE COMPLIANT PIT LADDER REQUIRED.',
  'CHIMNEY - FRACTURES ,LOOSE MORTAR': 'CHIMNEY -  FRACTURES , LOOSE MORTAR',
  'CHIMNEYS - MISSING CAPS': 'CHIMNEY -  MISSING CAP',
  'MISSING SMOKE DETECTOR': 'MISSING SMOKE DETECTORS.',
  'PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)': 'PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS  (14A-4-404.6)',
  'AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.': 'A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.',
  'APT 3 / SAFE WORK CONDITIONS - BOILER NOT WORKING . PER SECTION CODE (14X-3-305.1, 14X-8-804.1, 14X-8-804.4 AND 14X-8-804.5) BHA CODES': 'APT 103 / SAFE WORK CONDITIONS - BOILER NOT WORKING . PER SECTION CODE 14X-3-305.1, 14X-8-804.1, 14X-8-804.4 AND 14X-8-804.5) BHA CODES',
  'APT 1W / UNAPPROVE HEATING DEVICE = ELECT HEATER USED FOR HEAT . PER SECTION CODE . (14X-7-705.4 BHA CODES': 'APT 103 / UNAPPROVE HEATING DEVICE - ELECT HEATER USED FOR HEAT . PER SECTION CODE . (14X-7-705.4 BHA CODES',
  'REAR PORCHES -  STORING BBQ GRILLS.': 'REAR PORCH -  STORING BBQ GRILLS.',
  '914 / APT 1 / SMOKE DETECTOR - MISSING. CB197019 - SMOKE ALARMS / DETECTORS - INSTALL (14X-5-504.8, 14X-5-504.8.4,14B-9-907.2.10.1 THROUGH 14B-9-907.2.10.7, AND 14R-11-1104)': '2343 / APT 1 / SMOKE DETECTOR - MISSING. CB197019 - SMOKE ALARMS / DETECTORS - INSTALL (14X-5-504.8, 14X-5-504.8.4,14B-9-907.2.10.1 THROUGH 14B-9-907.2.10.7, AND 14R-11-1104)'},
 51,
 116)

This is more believable. The 116 pairs should represent 56 real pairs of comments, and some comments might have more than one partner, so there are only 51 renaming rules.

Let us apply this (larger!) comment-renaming fix, and see if it puts a larger dent in the number of unique comments:

violations_df["COMMENTS FIXED2"] = violations_df["VIOLATION INSPECTOR COMMENTS"].apply(renamecomment)
print("Records:               ", len(violations_df))
print("Distinct records:      ", len(violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts()) ) 
print("Distinct records fixed:", len(violations_df["COMMENTS FIXED2"].value_counts()))
Records:                1832
Distinct records:       1580
Distinct records fixed: 1533

Now, although there were 51 comments slated to be renamed, it only reduced the number of unique comments from 1580 to 1544 (by 47). The comments are still 5.3% not-a-number, but now 11.0% have non-unique comments.

So we were able to “clean up” 54 out of 1832 (2.9%) of the comment records. The effect on the unique comments is a bit more dramatic than the effect on distinct comments. Unique comments dropped from

print ("Unique comments before", violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts().value_counts()[1])
print ("Unique comments after ", violations_df["COMMENTS FIXED2"].value_counts().value_counts()[1])
print ("Out of                 ", len(violations_df))
Unique comments before 1451
Unique comments after  1375
Out of                  1832

The renaming-cleaning procedure reduced the fraction of the data in unique-comment categories by 4%. These are records that can presumably be better matched with other rows with similar content.

Clean data#

This exercise tried to clean up (meaningless) typographic differences in comment fields; in real-world problems, you are likely to encounter two databases with sometimes-matching and sometimes-mismatching data on more than one field. You might have a database of people from the HR department and another from the directory, or might have a database of addresses from a municipal database and a commercial database.

Some fields match more reliably than others; if your job is to somehow retrieve the most likely N matches to a test query, your solution will likely try to match on high-quality identifiers first (national ID numbers for people, FIPS codes for places, transaction ID numbers where appropriate) and attempt fuzzy matching only after exhausting the pool of more-likely-correct hits on identifiers, using a score that is a mixture of fuzzy-matching goodness-of-match scores from different fields.