Edit distance example: Building inspection reports#
Next we will be looking at a dataset with free-form human-entered text that has a field with differences that are not meaningful. Meaningless differences between records that should be identical make it more challenging to find parts of the dataset that are similar (and should be treated the same).
This dataset is from the City of Chicago data portal, and documents building inspections in the City of Chicago Department of Buildings. City of Chicago Data Portal. https://data.cityofchicago.org/Buildings/Building-Violations/22u3-xenr Accessed September 20, 2023, filtered for violations from the first half of January 2022.
violations_df = pd.read_csv("../../data/ChicagoBuildingViolationsEarlyJanuary2023.csv")
If your environment does not have the levenshtein
package installed, see if you can install it.
%pip install levenshtein
#!conda install -c conda-forge python-levenshtein
Requirement already satisfied: levenshtein in /opt/conda/lib/python3.12/site-packages (0.27.1)
Requirement already satisfied: rapidfuzz<4.0.0,>=3.9.0 in /opt/conda/lib/python3.12/site-packages (from levenshtein) (3.13.0)
Note: you may need to restart the kernel to use updated packages.
violations_df
ID | VIOLATION LAST MODIFIED DATE | VIOLATION DATE | VIOLATION CODE | VIOLATION STATUS | VIOLATION STATUS DATE | VIOLATION DESCRIPTION | VIOLATION LOCATION | VIOLATION INSPECTOR COMMENTS | VIOLATION ORDINANCE | ... | STREET NAME | STREET TYPE | PROPERTY GROUP | SSA | LATITUDE | LONGITUDE | LOCATION | YEAR | MONTH | DAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6729404 | 06/27/2023 | 01/10/2022 | EV1111 | COMPLIED | 06/27/2023 | MAINTAIN OR REPAIR HYDRO ELEVA | NaN | CITY FIRE RECALL TEST | Failed to maintain hydraulic elevator equipmen... | ... | OAKDALE | AVE | 8066 | 8.0 | 41.935438 | -87.642694 | (41.93543799801343, -87.64269400121434) | 2022 | 1 | 10 |
1 | 6726715 | 09/06/2022 | 01/03/2022 | CN070024 | COMPLIED | 09/06/2022 | REPAIR PORCH SYSTEM | OTHER : :OTHER371335 | FRONT PORCH - ROTTED WOOD CEILING, MISSING ALU... | Failed to repair or replace defective or missi... | ... | STATE | ST | 535078 | NaN | 41.675337 | -87.622431 | (41.67533708701544, -87.6224312766971) | 2022 | 1 | 3 |
2 | 6726705 | 06/29/2023 | 01/03/2022 | CN071024 | COMPLIED | 06/29/2023 | REPAIR STOOP | OTHER : :OTHER | FRONT STOOP, SIDE WALLS - WASHED OUT MORTAR ( ... | Failed to maintain stoop in safe condition and... | ... | ESCANABA | AVE | 435511 | NaN | 41.749876 | -87.553927 | (41.749876029131826, -87.55392704663606) | 2022 | 1 | 3 |
3 | 6731812 | 07/14/2023 | 01/13/2022 | CN062024 | COMPLIED | 07/14/2023 | PARAPET | OTHER : :OTHER | SOUTHWEST FRONT GARAGE PERAPET - WASHED OUT MO... | Failed to maintain parapet wall in good repair... | ... | 103RD | ST | 24594 | NaN | 41.706930 | -87.633353 | (41.70693036655445, -87.6333526539513) | 2022 | 1 | 13 |
4 | 6730529 | 06/13/2022 | 01/12/2022 | CN136026 | COMPLIED | 06/13/2022 | INSECTS | OTHER : :OTHER | BED BUG INFESTATION THROUGHOUT. - PROPERTY MUS... | Exterminate insects and keep dwelling insect-f... | ... | LINCOLN | AVE | 12595 | NaN | 41.983469 | -87.695449 | (41.98346922607376, -87.69544875484024) | 2022 | 1 | 12 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1827 | 6730801 | 06/22/2023 | 01/12/2022 | NC2020 | COMPLIED | 06/22/2023 | PERMIT REQUIRED | OTHER : CANOPY REMOVED PER PERMIT #101003272 | REAR YARD-OVERHEAD CANOPY ROOF INSTALLED BETWE... | Failed to obtain building permit for erecting,... | ... | MAJOR | AVE | 123856 | NaN | 41.932068 | -87.768851 | (41.93206815540178, -87.76885130741938) | 2022 | 1 | 12 |
1828 | 6730802 | 06/22/2023 | 01/12/2022 | NC2071 | COMPLIED | 06/22/2023 | WORK NOT ALLOWED | OTHER : :OTHER | REAR YARD-6-'4" HIGH VINYL FENCE INSTALLED AT ... | Remove work performed without permit and resto... | ... | MAJOR | AVE | 123856 | NaN | 41.932068 | -87.768851 | (41.93206815540178, -87.76885130741938) | 2022 | 1 | 12 |
1829 | 6730803 | 06/22/2023 | 01/12/2022 | NC2020 | COMPLIED | 06/22/2023 | PERMIT REQUIRED | OTHER : :OTHER | REAR YARD-OVERHEAD CANOPY ROOF INSTALLED BETWE... | Failed to obtain building permit for erecting,... | ... | MAJOR | AVE | 123856 | NaN | 41.932068 | -87.768851 | (41.93206815540178, -87.76885130741938) | 2022 | 1 | 12 |
1830 | 6735084 | 06/22/2023 | 01/04/2022 | CN132016 | OPEN | NaN | HEAT UNIT ADEQUATELY | NaN | LIVING ROOM 59 DEGREES, BEDROOM 60 DEGREES. GA... | Heat dwelling unit adequately from September 1... | ... | WENTWORTH | AVE | 20105 | NaN | 41.766092 | -87.629642 | (41.76609247175032, -87.62964151748226) | 2022 | 1 | 4 |
1831 | 6735085 | 06/22/2023 | 01/04/2022 | CN132046 | OPEN | NaN | SAFE WORKING CONDITIONS | NaN | FURNANCE OUT OF SERVICE AT TIME AT TIME OF INS... | Provide and maintain every facility, piece of ... | ... | WENTWORTH | AVE | 20105 | NaN | 41.766092 | -87.629642 | (41.76609247175032, -87.62964151748226) | 2022 | 1 | 4 |
1832 rows × 29 columns
Let’s poke around the data just a little:
violations_df["VIOLATION DATE"].value_counts().sort_index()
VIOLATION DATE
01/03/2022 205
01/04/2022 147
01/05/2022 160
01/06/2022 107
01/07/2022 229
01/08/2022 121
01/09/2022 27
01/10/2022 153
01/11/2022 127
01/12/2022 180
01/13/2022 119
01/14/2022 208
01/15/2022 49
Name: count, dtype: int64
These dates are from the first two weeks of 2022.
Building inspectors don’t record many violations on Sundays.
Let us look at the VIOLATION DESCRIPTION
field:
violations_df["VIOLATION DESCRIPTION"].value_counts()
VIOLATION DESCRIPTION
MAINTAIN OR REPAIR ELECT ELEVA 222
MAINTAIN OR REPAIR HYDRO ELEVA 123
HEAT UNIT ADEQUATELY 89
SAFE WORKING CONDITIONS 64
ARRANGE PREMISE INSPECTION 57
...
MAINT PLATFORM LIFT 1
SYSTEM II 1
DOOR W/PANIC HRDWR-200+ ASMBLY 1
INTERIOR FINISH FIRERESISTANCE 1
NO CLOSET/STORAGE UNDER STAIR 1
Name: count, Length: 233, dtype: int64
violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts()
VIOLATION INSPECTOR COMMENTS
INSPECTION 7
STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035 5
USING COOKING STOVE AS HEATING DEVICE. 4
WORKING TWO WAY COMMUNICATION (PHONE) REQUIRED IN ELEVATOR CAB. 4
INTERIOR INSPECTION REQUIRED TO DETERMINE COMPLETE SCOPE OF WORK FOR NEW PERMIT APPLICATION. 4
..
BASEMENT/LINTLES - LINTELS STILL WARPED AND RUSTING; (3) NORTH; (1) WEST 1
SUBMIT CRITICAL EXAM REPORT FOR ENTIRE EXTERIOR OF BUILDING FOR THE EXISTING HIGH-RISE BY A REGISTERED DESIGN PROFESSIONAL. THE EXAM IS A CLOSE-UP VISUAL EXAMINATION OF THE ENTIRE EXTERIOR ENVELOPE IS TO IDENTIFYING DEFICIENCIES AND DETERMINE IF REPAIR IS REQUIRED. EXTERIOR ENVELOPE TO INCLUDE BUT NOT LIMITED TO ROOF, EXTERIOR WALLS, WINDOWS, DOORS, BALCONIES, FIRE ESCAPES, CHIMNEYS, MECHANICAL EQUIPMENT, MARQUEES, CANOPIES, SIGNS, FLAG POLES AND EXTERIOR MAINTENANCE SYSTEMS. A SIGNED AND SEALED REPORT, DETAILING THE SCOPE AND FINDINGS OF THE EXAMINATION, TOGETHER WITH RECOMMENDATIONS FOR REPAIR WHERE DEFICIENCIES ARE FOUND, SHALL BE PROVIDED TO THE OWNER, AND FILED WITH THE BUILDING DEPARTMENT. IT IS THE OWNER'S RESPONSIBILITY TO OBTAIN REQUIRED PERMITS, ACCOMPANIED BY SUPPORTING DOCUMENTS AND CONSTRUCTION DOCUMENT PLANS. THE OWNER MUST KEEP A COPY OF THE MOST RECENT REPORT, PERMIT AND CONSTRUCTION DOCUMENTS ON FILE AT THE PREMISES AND MAKE AVAILABLE FOR INSPECTION REVIEW AND APPROVAL. SEE MAINTENANCE OF HIGH-RISE EXTERIOR WALLS AND ENCLOSURES RULES AND REGULATIONS. SUBMIT TO DEPARTMENT OF BUILDINGS, ATTENTION: EXTERIOR WALL PROGRAM 3RD FLOOR, 2045 W. WASHINGTON BLVD. 60612 (14A-6-603.2 THROUGH 14A-6-603.2.6, 14A-4-401.1 AND 14A-4-411.1) 1
WEST ELEVATION, MAIN BUILDING ENTRY AND EMPLOYEE ENTRY. OWNER TO IMMEDIATELY TAKE TEMPORARY MEASURES TO PROTECT THE PUBLIC, PREVENT IMMINENT HARM TO PEOPLE AND TO BEGIN PERMANENT REPAIRS WHERE THE ENVELOPE AND EXTERIOR WALLS OF A HIGH-RISE BUILDING IS IN UNSAFE CONDITION OR IN NEED OF REPAIR OR REINFORCEMENT. A BUILDING PERMIT APPLICATION COVERING THE SCOPE OF THE EMERGENCY REPAIR MUST BE SUBMITTED TO THE DEPARTMENT OF BUILDINGS ON THE NEXT BUSINESS DAY. THE DEPARTMENT OF TRANSPORTATION MAY NOT ISSUE A PERMIT FOR USE OF THE PUBLIC WAY TO INSTALL SCAFFOLDING/CANOPY UNLESS A PERMIT FOR THE SCAFFOLDING/CANOPY HAS BEEN OBTAINED FROM THE BUILDING DEPARTMENT. (14A-6-603.2.6, 14A-6-603.2, 14A-4-401.1.1, 14A-4-401.1.2, 14A-4-404.16 THROUGH 14A-4-404.16.5, 14A-4-404.22.3 AND 14B-33-3306) 1
REMOVED EXISTING HOUSE DOWN TO FIRST FLOOR JOISTS BUILDING 3 NEW STORIES OVER EXISTING FOUNDATION AND FIRST FLOOR JOISTS 1
SUBMIT PLANS AND OBTAIN PERMITS FOR ALL RELATED WORK TO AUTHORIZE ALTERATIONS TO BASEMENT OR RESTORE THE BUILDING TO THE ORIGINAL STATE. 1
Name: count, Length: 1580, dtype: int64
Ok, so 1580 different comments for 1832 rows. Are any values missing?
len(violations_df), violations_df["VIOLATION INSPECTOR COMMENTS"].isnull().sum()
(1832, np.int64(98))
Just counting, we have 1832 records, 98 of which (5.3%) are missing inspector comments. Of the remaining 1734 records with comments, there are 1579 different comments. This means there are 253 more records than distinct comments, so 13.8% of the comments are exact duplicates somewhere within the dataset.
Some pairs of records, however, have non-identical comments that probably should be identical. We can use the Levenshtein distance to identify similar but not exactly identical comments; we will make some decisions about how to merge records, and produce a dataset that has had the text of similar comments merged.
All-against all comment comparison#
First, we will compare all the distinct comments to all the other comments.
(This requires \(n^2\) comparisons, where n is the 1832 distinct comments
in the VIOLLATION INSPECTOR COMMENTS
field.) We can afford 3.3 million comparisons between strings.
import Levenshtein
from Levenshtein import distance
distinctcomments = violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts().index
distinctcommentnumber = violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts().values
for i in range(len(distinctcomments)):
for j in range(len(distinctcomments)):
d = distance(distinctcomments[i], distinctcomments[j])
# Find all the pairs that have Levenshtein distance 1
if d == 1:
# And print out the number of times and each of the pair.
print(distinctcommentnumber[i], distinctcomments[i])
print(distinctcommentnumber[j], distinctcomments[j])
5 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035
1 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035
4 USING COOKING STOVE AS HEATING DEVICE.
1 USING COOKING STOVE AS HEATING DEVICE
4 CHIMNEY - MISSING CAP
1 CHIMNEY - MISSING CAP
3 WORKING DOOR RESTRICTOR REQUIRED.
1 WORKING DOOR RESTRICTOR REQIRED.
2 MISSING SMOKE DETECTORS.
1 MISSING SMOKE DETECTOR.
2 STOVE HEAT/SPACE HEATERS.
1 STOVE HEAT/SPACE HEATER.
2 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS
1 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.
2 A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.
1 AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.
2 AT BASEMENT
1 AT BASEMENT.
1 CHIMNEY - MISSING CAP
4 CHIMNEY - MISSING CAP
1 PERFORM CAT1 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.
1 PERFORM CAT5 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.
1 PROVIDE MAINTENANCE CONTROL PROGRAM
1 PROVIDE MAINTENANCE CONTROL PROGRAM.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED OCCUPANCY.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERIFIED OCCUPANCY.
1 AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.
2 A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.
1 SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINT.
1 SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINTS.
1 ELEVATORS #4 & #5: PERFORM CAT! & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.
1 ELEVATORS #4 & #5: PERFORM CAT1 & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.
1 REPAIR EMERGENCY PHONE.
1 REPAIR EMERGENCY PHONES
1 MISSING SMOKE DETECTOR
1 MISSING SMOKE DETECTOR.
1 USING COOKING STOVE AS HEATING DEVICE
4 USING COOKING STOVE AS HEATING DEVICE.
1 PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)
1 PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)
1 CODE DATA TAG ON CONTROLER
1 CODE DATA TAG ON CONTROLLER
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERIFIED OCCUPANCY.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED OCCUPANCY.
1 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.
2 EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS
1 MISSING SMOKE DETECTOR.
2 MISSING SMOKE DETECTORS.
1 MISSING SMOKE DETECTOR.
1 MISSING SMOKE DETECTOR
1 PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)
1 PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)
1 STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-1 14A-4-411-1
1 STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-4 14A-4-411-1
1 STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-4 14A-4-411-1
1 STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-1 14A-4-411-1
1 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035
5 STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035
1 ELEVATORS #4 & #5: PERFORM CAT1 & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.
1 ELEVATORS #4 & #5: PERFORM CAT! & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.
1 CODE DATA TAG ON CONTROLLER
1 CODE DATA TAG ON CONTROLER
1 CAT. 5 TEST OVERDUE.
1 CAT. 1 TEST OVERDUE.
1 CAT. 1 TEST OVERDUE.
1 CAT. 5 TEST OVERDUE.
1 PERFORM CAT5 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.
1 PERFORM CAT1 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.
1 PROVIDE MAINTENANCE CONTROL PROGRAM.
1 PROVIDE MAINTENANCE CONTROL PROGRAM
1 SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINTS.
1 SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINT.
1 REPAIR EMERGENCY PHONES
1 REPAIR EMERGENCY PHONE.
1 WORKING DOOR RESTRICTOR REQIRED.
3 WORKING DOOR RESTRICTOR REQUIRED.
1 AT BASEMENT.
2 AT BASEMENT
1 STOVE HEAT/SPACE HEATER.
2 STOVE HEAT/SPACE HEATERS.
Note: This all-against all comparison computes both
and
which are always equal. So this double-counts all the pairs. Note also we aren’t storing the distances, just calculating all of them and taking action for some values of the distance.
Glancing at these pairs of comments, somme are misspellings (UNVERFIED, CAT!, AE) and some are differences in
punctuation. Surely, CHIMNEY - MISSING CAP
is not meaningfully different from CHIMNEY - MISSING CAP
with two spaces after the hyphen. These are meaningless differences.
A handful of the differences, however, may be meaningful. CAT1 and CAT5 are different inspection schedules for different types of elevators, and the two comments have a distance of only 1. Differences in addresses, directions, permit numbers and cardinal directions can have small edit distances but can change the meaning.
Creating a new field that merges comments that are more similar than a certain edit distance risks corrupting some of the data (SOUTH and NORTH have edit distance 2). We should always preserve the original data when “cleaning” in this way. But merging comments that mean the same thing will make it easier to find groups and understand the data by removing differences (in punctuation and spelling) that are in no way relevant to the inspections.
To merge the similar-but-not-identical comments, we will choose a threshold for merging comments, create a dictionary that maps old comments to new cleaned up comments, write a function that uses this dictionary to replace the comments that are on the list, apply this function, and finally check that we get what we expect.
First, let us decide on a threshold. What do comments look like at d=4?
for i in range(len(distinctcomments)):
for j in range(len(distinctcomments)):
d = distance(distinctcomments[i], distinctcomments[j])
if d == 4: # Find all the pairs that have Levenshtein distance 1
print(distinctcommentnumber[i], distinctcomments[i])
print(distinctcommentnumber[j], distinctcomments[j])
2 INTERIOR OF BUILDING - MICE INFESTATION
2 INTERIOR OF BUILDING - ROACH INFESTATION
2 STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411
1 STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411
2 INTERIOR OF BUILDING - ROACH INFESTATION
2 INTERIOR OF BUILDING - MICE INFESTATION
1 WEST CORNICE - WASHED OUT MORTAR.
1 NORTH CORNICE - WASHED OUT MORTAR.
1 CHIMNEY - FRACTURES , LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY - FRACTURES ,LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 NORTH CORNICE - WASHED OUT MORTAR.
1 WEST CORNICE - WASHED OUT MORTAR.
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY - FRACTURES , LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY - FRACTURES ,LOOSE MORTAR
1 CHIMNEY - FRACTURES AND LOOSE MORTAR
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411
2 STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411
These all look pretty solidly identical. The insertion of the word “NEW”, the replacement of ” ,” with ” AND “. We can merge these.
What about d=6?
for i in range(len(distinctcomments)):
for j in range(len(distinctcomments)):
d = distance(distinctcomments[i], distinctcomments[j])
if d == 6: # Find all the pairs that have Levenshtein distance 1
print(distinctcommentnumber[i], distinctcomments[i])
print(distinctcommentnumber[j], distinctcomments[j])
3 WEST, EAST, NORTH, ELEVATIONS - ABANDONED VEHICLE IN REAR YARD, BROKEN AWNING ON FRONT PORCH; LOTS OF JUNK, GARBAGE, APPLIANCES, BAGS OF CONCRETE, SHOPPING CARTS, AUTO PARTS AND TIRES, TELEVISION, BROKEN DOORS, TRASH, DEBRIS, ETC. THROUGHOUT; EXCESSIVE C
2 WEST, EAST, NORTH, ELEVATIONS - ABANDONED VEHICLE IN REAR YARD, BROKEN AWNING ON FRONT PORCH; LOTS OF JUNK, GARBAGE, APPLIANCES, BAGS OF CONCRETE, SHOPPING CARTS, AUTO PARTS AND TIRES, TELEVISION, BROKEN DOORS, TRASH, DEBRIS, ETC. THROUGHOUT; EXCESSIVE CLUTTER
2 WEST, EAST, NORTH, ELEVATIONS - ABANDONED VEHICLE IN REAR YARD, BROKEN AWNING ON FRONT PORCH; LOTS OF JUNK, GARBAGE, APPLIANCES, BAGS OF CONCRETE, SHOPPING CARTS, AUTO PARTS AND TIRES, TELEVISION, BROKEN DOORS, TRASH, DEBRIS, ETC. THROUGHOUT; EXCESSIVE CLUTTER
3 WEST, EAST, NORTH, ELEVATIONS - ABANDONED VEHICLE IN REAR YARD, BROKEN AWNING ON FRONT PORCH; LOTS OF JUNK, GARBAGE, APPLIANCES, BAGS OF CONCRETE, SHOPPING CARTS, AUTO PARTS AND TIRES, TELEVISION, BROKEN DOORS, TRASH, DEBRIS, ETC. THROUGHOUT; EXCESSIVE C
2 MISSING SMOKE DETECTORS.
2 NO SMOKE DETECTORS.
2 NO SMOKE DETECTORS.
2 MISSING SMOKE DETECTORS.
1 WEST ELEVATION / GARAGE - STEP CRACKS.
1 EAST ELEVATION / PARAPET - STEP CRACKS .
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED COMPLAINT OF REAR PORCH UNSAFE.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED COMPLAINT OF REAR PORCH SLANTED.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED COMPLAINT OF REAR PORCH SLANTED.
1 INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED COMPLAINT OF REAR PORCH UNSAFE.
1 PROVIDE FIRE EXTINGUSHER IN MACHINE ROOM
1 PROVIDE FIRE EXTINGUISHER INSIDE MACHINE ROOM.
1 REPAIR EMERGENCY PHONE.
1 REPAIR BOTH EMERGENCY PHONES
1 X
1 PANELS
1 EMERGENCY PHONE ISN'T WORKING. ALL CARS
1 EMERGENCY PHONE ISN'T WORKING. CAR 1
1 SOUTH AND WEST ELEVATIONS - WASHED OUT MORTAR AND SPALLING BRICKS.
1 NORTH AND SOUTH ELEVATIONS - WASHED OUT MORTAR AND SPALLING BRICKS.
1 EMERGENCY PHONE ISN'T WORKING. CAR 1
1 EMERGENCY PHONE ISN'T WORKING. ALL CARS
1 NORTH AND SOUTH ELEVATIONS - WASHED OUT MORTAR AND SPALLING BRICKS.
1 SOUTH AND WEST ELEVATIONS - WASHED OUT MORTAR AND SPALLING BRICKS.
1 MISSING CARBON MONOXIDE DETECTOR.
1 NO CARBON MONOXIDE DETECTOR.
1 NO CARBON MONOXIDE DETECTOR.
1 MISSING CARBON MONOXIDE DETECTOR.
1 PANELS
1 X
1 PANELS
1 NO HEAT
1 NO HEAT
1 PANELS
1 NO HEAT
1 FIRE TEST
1 SECURE UNSUPPORTED BOXES IN BASEMENT AND WHERE REQUIRED IN TYPICAL SPACES AND AREAS THROUGHOUT PREMISES
1 SECURE UNSUPPORTED CONDUITS IN BASEMENT AND WHERE REQUIRED IN TYPICAL SPACES AND AREAS THROUGHOUT PREMISES
1 SECURE UNSUPPORTED CONDUITS IN BASEMENT AND WHERE REQUIRED IN TYPICAL SPACES AND AREAS THROUGHOUT PREMISES
1 SECURE UNSUPPORTED BOXES IN BASEMENT AND WHERE REQUIRED IN TYPICAL SPACES AND AREAS THROUGHOUT PREMISES
1 WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH REQUIRED.
1 WORKING PIT LIGHTING AND AN ACCESSIBLE LIGHT SWITCH IS REQUIRED.
1 PROVIDE FIRE EXTINGUISHER INSIDE MACHINE ROOM.
1 PROVIDE FIRE EXTINGUSHER IN MACHINE ROOM
1 FIRE TEST
1 NO HEAT
1 REPAIR BOTH EMERGENCY PHONES
1 REPAIR EMERGENCY PHONE.
1 EAST ELEVATION / PARAPET - STEP CRACKS .
1 WEST ELEVATION / GARAGE - STEP CRACKS.
1 WORKING PIT LIGHTING AND AN ACCESSIBLE LIGHT SWITCH IS REQUIRED.
1 WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH REQUIRED.
This is clearly too much. NO HEAT
should not be merged with PANELS
or FIRE TEST
.
What about d=5?
for i in range(len(distinctcomments)):
for j in range(len(distinctcomments)):
d = distance(distinctcomments[i], distinctcomments[j])
if d == 5: # Find all the pairs that have Levenshtein distance 1
print(distinctcommentnumber[i], distinctcomments[i])
print(distinctcommentnumber[j], distinctcomments[j])
2 WEST ELEVATION / DOWNSPOUT - MISSING
2 SOUTH ELEVATION / DOWNSPOUT - MISSING.
2 FRONT EXTERIOR STAIRS-GRADE TO BASEMENT-MISSING HAND RAIL.
2 REAR EXTERIOR STAIRS,GRADE TO BASEMENT-MISSING HAND RAIL.
2 REAR EXTERIOR STAIRS,GRADE TO BASEMENT-MISSING HAND RAIL.
2 FRONT EXTERIOR STAIRS-GRADE TO BASEMENT-MISSING HAND RAIL.
2 SOUTH ELEVATION / DOWNSPOUT - MISSING.
2 WEST ELEVATION / DOWNSPOUT - MISSING
2 CODE COMPLIANT PIT LADDER REQUIRED.
1 A CODE COMPLIANT PIT LADDER IS REQUIRED.
1 CHIMNEY - FRACTURES , LOOSE MORTAR
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 APT 1 / CARBON MONOXIDE - MISSING . PER SECTION CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)
1 APT 2R / CARBON MONOXIDE - MISSING . PER SECOND CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)
1 (PREVIOUSLY CITED ON 9/7/16) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS.
1 (PREVIOUSLY CITED ON 8/14/18) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS
1 CHIMNEY - FRACTURES ,LOOSE MORTAR
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 NORTH ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.
1 WEST ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.
1 A CODE COMPLIANT PIT LADDER IS REQUIRED.
2 CODE COMPLIANT PIT LADDER REQUIRED.
1 BUILDING INFESTED WITH ROACHES, INCLUDING APT. 2E.
1 BUILDING INFESTED WITH MICE, INCLUDING APT. 2E.
1 BUILDING INFESTED WITH MICE, INCLUDING APT. 2E.
1 BUILDING INFESTED WITH ROACHES, INCLUDING APT. 2E.
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 CHIMNEY - FRACTURES , LOOSE MORTAR
1 CHIMNEY - FRACTURES WITH LOOSE MORTAR
1 CHIMNEY - FRACTURES ,LOOSE MORTAR
1 EAST PASSENGER: REPAIR EMERGENCY ALARM
1 EAST PASSENGER: REPAIR EMERGENCY PHONE
1 SOUTH ELEVATION / COPING - OPEN MORTAR JOINTS.
1 SOUTH ELEVATION / CORNICE - OPEN MORTAR JOINT.
1 EAST PASSENGER: REPAIR EMERGENCY PHONE
1 EAST PASSENGER: REPAIR EMERGENCY ALARM
1 (PREVIOUSLY CITED ON 8/14/18) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS
1 (PREVIOUSLY CITED ON 9/7/16) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS.
1 REPAIR BOTH EMERGENCY PHONES
1 REPAIR EMERGENCY PHONES
1 APT 2R / CARBON MONOXIDE - MISSING . PER SECOND CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)
1 APT 1 / CARBON MONOXIDE - MISSING . PER SECTION CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)
1 SOUTH ELEVATION / CORNICE - OPEN MORTAR JOINT.
1 SOUTH ELEVATION / COPING - OPEN MORTAR JOINTS.
1 REPAIR EMERGENCY PHONES
1 REPAIR BOTH EMERGENCY PHONES
1 WEST ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.
1 NORTH ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.
These all look similar enough to merge. Let us merge comments with d less than or equal 5.
Apply the changes#
renamedict = {}
for i in range(len(distinctcomments)):
for j in range(len(distinctcomments)):
d = distance(distinctcomments[i], distinctcomments[j])
# Find all the pairs that have Levenshtein distance <=5
if d <= 5 and d !=0 :
# If comment i is used fewer times than j
if distinctcommentnumber[i] < distinctcommentnumber[j] :
# Flag comment i to be changed to comment j
renamedict[distinctcomments[i]] = distinctcomments[j]
renamedict, len(renamedict)
({'CHIMNEY - MISSING CAP': 'CHIMNEY - MISSING CAP',
'AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.': 'A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.',
'MISSING SMOKE DETECTOR': 'MISSING SMOKE DETECTORS.',
'USING COOKING STOVE AS HEATING DEVICE': 'USING COOKING STOVE AS HEATING DEVICE.',
'CHIMNEYS - MISSING CAPS': 'CHIMNEY - MISSING CAP',
'A CODE COMPLIANT PIT LADDER IS REQUIRED.': 'CODE COMPLIANT PIT LADDER REQUIRED.',
'STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.': 'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS',
'MISSING SMOKE DETECTOR.': 'MISSING SMOKE DETECTORS.',
'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035': 'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035',
'WORKING DOOR RESTRICTOR REQIRED.': 'WORKING DOOR RESTRICTOR REQUIRED.',
'STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
'AT BASEMENT.': 'AT BASEMENT',
'STOVE HEAT/SPACE HEATER.': 'STOVE HEAT/SPACE HEATERS.'},
14)
This looks reasonable. Write the function:
def renamecomment(comment):
if comment in renamedict.keys():
return renamedict[comment]
else:
return comment
I don’t want to write a function without some kind of testing to see that it does what I think it does.
I’ll write two tests, simple invocations of renamecomment
with known correct answers.
renamecomment("NOT IN DICTIONARY") == "NOT IN DICTIONARY"
True
renamecomment('MISSING SMOKE DETECTOR') == 'MISSING SMOKE DETECTORS.'
True
Ok, seems to work. Let us create a new column with apply
:
violations_df["COMMENTS FIXED"] = violations_df["VIOLATION INSPECTOR COMMENTS"].apply(renamecomment)
And how do we know that this did anything? Let’s check which rows were changed:
violations_df.query("`COMMENTS FIXED` != `VIOLATION INSPECTOR COMMENTS`")
ID | VIOLATION LAST MODIFIED DATE | VIOLATION DATE | VIOLATION CODE | VIOLATION STATUS | VIOLATION STATUS DATE | VIOLATION DESCRIPTION | VIOLATION LOCATION | VIOLATION INSPECTOR COMMENTS | VIOLATION ORDINANCE | ... | STREET TYPE | PROPERTY GROUP | SSA | LATITUDE | LONGITUDE | LOCATION | YEAR | MONTH | DAY | COMMENTS FIXED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6726415 | 03/08/2022 | 01/03/2022 | VT1010 | COMPLIED | 03/08/2022 | ARRANGE COMPLETED PERMIT INSP. | NaN | NaN | Arrange mechanical ventilation or warm air hea... | ... | AVE | 198348 | 22.0 | 41.980342 | -87.669580 | (41.98034244999325, -87.66958046014629) | 2022 | 1 | 3 | NaN |
22 | 6732325 | 09/21/2022 | 01/11/2022 | VT1030 | COMPLIED | 09/20/2022 | ARRANGE PERMIT INSP. | 1115 N CHRISTIANA AV | NaN | Arrange in writing for mechanical ventilating ... | ... | AVE | 51247 | NaN | 41.901508 | -87.710433 | (41.901507612598834, -87.71043339750256) | 2022 | 1 | 11 | NaN |
23 | 6735146 | 03/11/2022 | 01/09/2022 | CN046013 | COMPLIED | 03/11/2022 | UNAPPROVED HEATING DEVICE | NaN | STOVE HEAT/SPACE HEATER. | Stop using cooking or water heating device as ... | ... | AVE | 21765 | NaN | 41.691177 | -87.612259 | (41.691177471896076, -87.61225906980805) | 2022 | 1 | 9 | STOVE HEAT/SPACE HEATERS. |
24 | 6727334 | 01/04/2022 | 01/04/2022 | PL151137 | OPEN | NaN | OPEN | NaN | NaN | NaN | ... | AVE | 400614 | 14.0 | 41.770689 | -87.685700 | (41.77068909096362, -87.68569998583654) | 2022 | 1 | 4 | NaN |
25 | 6730399 | 01/12/2022 | 01/11/2022 | PLD2290 | OPEN | NaN | PROVIDE AUTOMATIC MIXING DEVIC | NaN | NaN | Provide automatic mixing device for public lav... | ... | ST | 1596 | NaN | 41.887222 | -87.757826 | (41.887221537144235, -87.75782634562349) | 2022 | 1 | 11 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1598 | 6729116 | 08/31/2022 | 01/10/2022 | VT1010 | COMPLIED | 08/31/2022 | ARRANGE COMPLETED PERMIT INSP. | NaN | NaN | Arrange mechanical ventilation or warm air hea... | ... | AVE | 156823 | NaN | 41.947644 | -87.781798 | (41.947644039902265, -87.78179844629967) | 2022 | 1 | 10 | NaN |
1618 | 6730385 | 11/21/2022 | 01/11/2022 | VT1010 | COMPLIED | 11/18/2022 | ARRANGE COMPLETED PERMIT INSP. | NaN | NaN | Arrange mechanical ventilation or warm air hea... | ... | ST | 286211 | NaN | 41.841823 | -87.701135 | (41.84182295978573, -87.70113477596593) | 2022 | 1 | 11 | NaN |
1621 | 6726984 | 12/07/2022 | 01/03/2022 | PL151137 | COMPLIED | 12/07/2022 | OPEN | NaN | NaN | NaN | ... | AVE | 60561 | NaN | 41.904908 | -87.676385 | (41.904907565709664, -87.67638534234142) | 2022 | 1 | 3 | NaN |
1627 | 6728833 | 11/15/2022 | 01/07/2022 | VT1010 | OPEN | NaN | ARRANGE COMPLETED PERMIT INSP. | NaN | NaN | Arrange mechanical ventilation or warm air hea... | ... | AVE | 117688 | NaN | 41.933936 | -87.715618 | (41.933936312229164, -87.71561838752238) | 2022 | 1 | 7 | NaN |
1774 | 6728033 | 05/25/2023 | 01/03/2022 | EV1111 | OPEN | NaN | MAINTAIN OR REPAIR HYDRO ELEVA | NaN | WORKING DOOR RESTRICTOR REQIRED. | Failed to maintain hydraulic elevator equipmen... | ... | BLVD | 14097 | NaN | 42.018888 | -87.684722 | (42.01888785624424, -87.68472174840898) | 2022 | 1 | 3 | WORKING DOOR RESTRICTOR REQUIRED. |
112 rows × 30 columns
112 rows were changed. Hmm. So what did this do to the number of unique comments? We expect something like 100 fewer unique comments.. right?
print("Records: ", len(violations_df))
print("Blank records: ", violations_df["VIOLATION INSPECTOR COMMENTS"].isnull().sum())
print("Distinct records: ", len(violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts()) )
print("Distinct records fixed:", len(violations_df["COMMENTS FIXED"].value_counts()))
Records: 1832
Blank records: 98
Distinct records: 1580
Distinct records fixed: 1566
This is a little puzzling. The renaming changed 112 rows but only reduced the number of unique entries by 14.
This is not what I was expecting; something is not working quite as it should. Let’s look at some of the 112
rows that were changed more closely:
changed = violations_df.query("`COMMENTS FIXED` != `VIOLATION INSPECTOR COMMENTS`")
changed[["VIOLATION INSPECTOR COMMENTS", "COMMENTS FIXED"]]
VIOLATION INSPECTOR COMMENTS | COMMENTS FIXED | |
---|---|---|
5 | NaN | NaN |
22 | NaN | NaN |
23 | STOVE HEAT/SPACE HEATER. | STOVE HEAT/SPACE HEATERS. |
24 | NaN | NaN |
25 | NaN | NaN |
... | ... | ... |
1598 | NaN | NaN |
1618 | NaN | NaN |
1621 | NaN | NaN |
1627 | NaN | NaN |
1774 | WORKING DOOR RESTRICTOR REQIRED. | WORKING DOOR RESTRICTOR REQUIRED. |
112 rows × 2 columns
This is disappointing. The 112 includes NaNs, since NaN does not match with NaN. Show only the non-NaN rows:
changed[~changed["COMMENTS FIXED"].isnull()][["VIOLATION INSPECTOR COMMENTS", "COMMENTS FIXED"]]
VIOLATION INSPECTOR COMMENTS | COMMENTS FIXED | |
---|---|---|
23 | STOVE HEAT/SPACE HEATER. | STOVE HEAT/SPACE HEATERS. |
34 | STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION... | STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A... |
168 | AT BASEMENT. | AT BASEMENT |
204 | CHIMNEY - MISSING CAP | CHIMNEY - MISSING CAP |
529 | AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE... | A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO B... |
777 | USING COOKING STOVE AS HEATING DEVICE | USING COOKING STOVE AS HEATING DEVICE. |
802 | MISSING SMOKE DETECTOR | MISSING SMOKE DETECTORS. |
876 | A CODE COMPLIANT PIT LADDER IS REQUIRED. | CODE COMPLIANT PIT LADDER REQUIRED. |
898 | CHIMNEYS - MISSING CAPS | CHIMNEY - MISSING CAP |
1104 | STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 1... | STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A... |
1188 | EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR... | EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR... |
1201 | MISSING SMOKE DETECTOR. | MISSING SMOKE DETECTORS. |
1485 | STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS... | STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS... |
1774 | WORKING DOOR RESTRICTOR REQIRED. | WORKING DOOR RESTRICTOR REQUIRED. |
These are the 14 that were changed.
Fix doesn’t fix enough comments#
Something doesn’t quite add up here. Let me count the number of pairs that were found that were close enough to be merged.
renamedict = {}
n=0
for i in range(len(distinctcomments)):
for j in range(len(distinctcomments)):
d = distance(distinctcomments[i], distinctcomments[j])
# Find all the pairs that have Levenshtein distance <=5
if d <= 5 and d !=0 :
n = n+1
# If comment i is used fewer times than j
if distinctcommentnumber[i] < distinctcommentnumber[j] :
# Flag comment i to be changed to comment j
renamedict[distinctcomments[i]] = distinctcomments[j]
renamedict, len(renamedict), n
({'CHIMNEY - MISSING CAP': 'CHIMNEY - MISSING CAP',
'AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.': 'A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.',
'MISSING SMOKE DETECTOR': 'MISSING SMOKE DETECTORS.',
'USING COOKING STOVE AS HEATING DEVICE': 'USING COOKING STOVE AS HEATING DEVICE.',
'CHIMNEYS - MISSING CAPS': 'CHIMNEY - MISSING CAP',
'A CODE COMPLIANT PIT LADDER IS REQUIRED.': 'CODE COMPLIANT PIT LADDER REQUIRED.',
'STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.': 'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS',
'MISSING SMOKE DETECTOR.': 'MISSING SMOKE DETECTORS.',
'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035': 'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035',
'WORKING DOOR RESTRICTOR REQIRED.': 'WORKING DOOR RESTRICTOR REQUIRED.',
'STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
'AT BASEMENT.': 'AT BASEMENT',
'STOVE HEAT/SPACE HEATER.': 'STOVE HEAT/SPACE HEATERS.'},
14,
116)
There were 116 pairs of comments (out of 3.3 million) that were closer than d = 5 but the loop only left instructions to rename 14 comments? I need to look closely at the logic here:
Let’s talk through the logic here.
Which comments get renamed? Comments that are more than 0 but 5 or fewer edits away from another comment, and for which the other comment has strictly greater abundance.
Aha. For this dataset, most of the comments are unique, and most of the pairs of similar comments have both comments unique. Comments that could be merged, but that are present only once are ignored.
Let us add logic that merges comments even if they have the same number of occurrences. We need an arbitrary rule to pick one comment when the counts are tied: let us choose the one that is first in alphabetical order. This is a canonicalization choice: we introduce an arbitrary choice between two similar comments to allow consolidation without having to make a choice that requires human input, as choosing the correct spelling would.
renamedict = {}
n=0
for i in range(len(distinctcomments)):
for j in range(len(distinctcomments)):
d = distance(distinctcomments[i], distinctcomments[j])
# Find all the pairs that have Levenshtein distance <=5
if d <= 5 and d !=0 :
n = n+1
# If comment i is used fewer times than j
if distinctcommentnumber[i] < distinctcommentnumber[j] :
# Flag comment i to be changed to comment j
renamedict[distinctcomments[i]] = distinctcomments[j]
# If two comments have the same number of counts:
if distinctcommentnumber[i] == distinctcommentnumber[j]:
# Rename if j comes first in lexicographic (alphabetical) order:
if distinctcomments[j] < distinctcomments[i]:
renamedict[distinctcomments[i]] = distinctcomments[j]
renamedict, len(renamedict), n
({'WEST ELEVATION / DOWNSPOUT - MISSING': 'SOUTH ELEVATION / DOWNSPOUT - MISSING.',
'REAR EXTERIOR STAIRS,GRADE TO BASEMENT-MISSING HAND RAIL.': 'FRONT EXTERIOR STAIRS-GRADE TO BASEMENT-MISSING HAND RAIL.',
'INTERIOR OF BUILDING - ROACH INFESTATION': 'INTERIOR OF BUILDING - MICE INFESTATION',
'WEST CORNICE - WASHED OUT MORTAR.': 'NORTH CORNICE - WASHED OUT MORTAR.',
'CHIMNEY - MISSING CAP': 'CHIMNEY - MISSING CAP',
'PERFORM CAT5 SAFETY TEST AND LEAVE CITY DOCUMENTATION.': 'PERFORM CAT1 SAFETY TEST AND LEAVE CITY DOCUMENTATION',
'APT 1W / UNAPPROVE HEATING DEVICE = ELECT HEATER USED FOR HEAT . PER SECTION CODE . (14X-7-705.4 BHA CODES': 'APT 103 / UNAPPROVE HEATING DEVICE - ELECT HEATER USED FOR HEAT . PER SECTION CODE . (14X-7-705.4 BHA CODES',
'APT 3 / SAFE WORK CONDITIONS - BOILER NOT WORKING . PER SECTION CODE (14X-3-305.1, 14X-8-804.1, 14X-8-804.4 AND 14X-8-804.5) BHA CODES': 'APT 103 / SAFE WORK CONDITIONS - BOILER NOT WORKING . PER SECTION CODE 14X-3-305.1, 14X-8-804.1, 14X-8-804.4 AND 14X-8-804.5) BHA CODES',
'REAR PORCHES - STORING BBQ GRILLS.': 'REAR PORCH - STORING BBQ GRILLS.',
'AMAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.': 'A MAINTENANCE CONTROL PROGRAM IS REQUIRED TO BE ON SITE.',
'WEST WINDOW SILLS - WASHED OUT MORTAR.': 'EAST WINDOW SILLS - WASHED OUT MORTAR.',
'914 / APT 1 / SMOKE DETECTOR - MISSING. CB197019 - SMOKE ALARMS / DETECTORS - INSTALL (14X-5-504.8, 14X-5-504.8.4,14B-9-907.2.10.1 THROUGH 14B-9-907.2.10.7, AND 14R-11-1104)': '2343 / APT 1 / SMOKE DETECTOR - MISSING. CB197019 - SMOKE ALARMS / DETECTORS - INSTALL (14X-5-504.8, 14X-5-504.8.4,14B-9-907.2.10.1 THROUGH 14B-9-907.2.10.7, AND 14R-11-1104)',
'(PREVIOUSLY CITED ON 9/7/16) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS.': '(PREVIOUSLY CITED ON 8/14/18) CODE DATA PLATE REQUIRED ON CONTROLLER. BOTH ELEVATORS',
'MISSING SMOKE DETECTOR': 'MISSING SMOKE DETECTORS.',
'USING COOKING STOVE AS HEATING DEVICE': 'USING COOKING STOVE AS HEATING DEVICE.',
'PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)': 'PROVIDE ELECTRICAL PERMIT IN ORDER TO CORRECT ALL ELECTRICAL VIOLATIONS (14A-4-404.6)',
'CHIMNEYS - MISSING CAPS': 'CHIMNEY - MISSING CAP',
'CHIMNEY - FRACTURES ,LOOSE MORTAR': 'CHIMNEY - FRACTURES , LOOSE MORTAR',
'A CODE COMPLIANT PIT LADDER IS REQUIRED.': 'CODE COMPLIANT PIT LADDER REQUIRED.',
'BUILDING INFESTED WITH ROACHES, INCLUDING APT. 2E.': 'BUILDING INFESTED WITH MICE, INCLUDING APT. 2E.',
'CHIMNEY - FRACTURES WITH LOOSE MORTAR': 'CHIMNEY - FRACTURES AND LOOSE MORTAR',
'CHIMNEY - FRACTURES AND LOOSE MORTAR': 'CHIMNEY - FRACTURES ,LOOSE MORTAR',
'INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERIFIED OCCUPANCY.': 'INTERIOR OF BUILDING /NO RESPONSE. UNVERIFIED DETECTORS AND CONDITIONS. UNVERFIED OCCUPANCY.',
'EAST PASSENGER: REPAIR EMERGENCY PHONE': 'EAST PASSENGER: REPAIR EMERGENCY ALARM',
'STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL PERMIT IS OBTAINED SECTION 14A-4-411',
'MISSING CARBON MONOXIDE DETECTOR.': 'MISSING CARBOB MONOXIDE DETECTOR',
'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS.': 'EMPLOY LICENSED AND BONDED PLUMBING CONTRACTOR FOR PLUMBING WORK TO BE PERFORMED TO ADDRESS PLUMBING VIOLATIONS',
'MISSING SMOKE DETECTOR.': 'MISSING SMOKE DETECTOR',
'ERECTING NEW INTERIOR PARTITIONS': 'ERECTED NEW INTERIOR PARTITIONS',
'STENCIL PIT FLOOR INFRONT OF COUNTERWEIGHT/PROVIDE SIGNAGE ON PIT WALL RUNBY/DANGER': 'STENCIL PIT FLOOR IN FROT OF COUNTERWEIGHT/PROVIDE SIGNAGE ON PIT WALL RUNBY/DANGER',
'STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-4 14A-4-411-1': 'STOP ALL WORK ON PREMISES UNTIL A BUILDING PERMIT IS OBTAINED 14A-3-306-1 14A-4-411-1',
'WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH REQUIRED.': 'WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH IS REQUIRED.',
'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT AE OBTAINED 13-12-080 13-32-035': 'STOP ALL WORK ON PREMISES UNTIL APPROVED PLANS AND BUILDING PERMIT ARE OBTAINED 13-12-080 13-32-035',
'ELEVATORS #4 & #5: PERFORM CAT1 & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.': 'ELEVATORS #4 & #5: PERFORM CAT! & CAT5 TEST. PROVIDE TEST SHEET AND TAGS ONCE COMPLETED.',
'NEXT CATEGORY 5 TEST DUE BY 06/02/2022.': 'NEXT CATEGORY 1 TEST DUE BY 02/04/2022.',
'CODE DATA TAG ON CONTROLLER': 'CODE DATA TAG ON CONTROLER',
'CAT. 5 TEST OVERDUE.': 'CAT. 1 TEST OVERDUE.',
'#15 & #16 ELEVATORS MUST BE REPRESENTED ON THE ELEVATOR CONTROL PANEL (LOBBY PANEL) INCLUDING AUXILARY PHASE I KEY SWITCH(TO RECALL #15, #16, AND #17 ALL SHARING THE SAME MACHINE ROOM), POSITION INDICATOR, INDICATOR OF OPERATING MODES, AND REPRESENTATION ON THE ETCHED DEPICTION (BUILDING LAYOUT MAP) OF THE BUILDING WITH ALL OF THE ELEVATORS WITHIN': '#15 & #16 ELEVATORS MUST BE REPRESENTED ON THE ELEVATOR CONTROL PANEL (LOBBY PANEL) INCLUDING AUXILARY PHASE I KEY SWITCH(TO RECALL #15, #16, AND #17 ALL SHARING THE SAME MACHINE ROOM), POSITION INDICATOR, INDICATOR OF OPERATING MODES, AND REEPRESENTATION ON THE ETCHED DEPICTION (BUILDING LAYOUT MAP) OF THE BUILDING WITH ALL OF THE ELEVATORS WITHIN.',
'APT 2R / CARBON MONOXIDE - MISSING . PER SECOND CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)': 'APT 1 / CARBON MONOXIDE - MISSING . PER SECTION CODE 14X-5-504.9, 14B-9-915.1.1, 14B-9-915.2 , AND 14R-11-1105)',
'PERFORM CAT5 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.': 'PERFORM CAT1 SAFETY TEST LEAVE CITY OF CHICAGO DOCUMENTATION.TRACTION ELEVATOR.',
'PROVIDE MAINTENANCE CONTROL PROGRAM.': 'PROVIDE MAINTENANCE CONTROL PROGRAM',
'SOUTH ELEVATION / CORNICE - OPEN MORTAR JOINT.': 'SOUTH ELEVATION / COPING - OPEN MORTAR JOINTS.',
'SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINTS.': 'SOUTH ELEVATION / WINDOW SILLS - OPEN MORTAR JOINT.',
'REPAIR EMERGENCY PHONES': 'REPAIR BOTH EMERGENCY PHONES',
'WORKING PIT LIGHTING AND AN ACCESSIBLE LIGHT SWITCH IS REQUIRED.': 'WORKING PIT LIGHTING AND ACCESSIBLE LIGHT SWITCH IS REQUIRED.',
'WORKING DOOR RESTRICTOR REQIRED.': 'WORKING DOOR RESTRICTOR REQUIRED.',
'WEST ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.': 'NORTH ELEVATION / EXTERIOR WALL - MASONRY FRACTURES.',
'STOP WORK UNTIL NEW PERMIT IS OBTAINED SECTION 14A-4-411': 'STOP WORK UNTIL A PERMIT IS OBTAINED SECTION 14A-4-411',
'AT BASEMENT.': 'AT BASEMENT',
'REAR YARD - ABANDONED CARS': 'REAR YARD - ABANADONED CAR.',
'STOVE HEAT/SPACE HEATER.': 'STOVE HEAT/SPACE HEATERS.'},
51,
116)
This is more believable. The 116 pairs should represent 56 real pairs of comments, and some comments might have more than one partner, so there are only 51 renaming rules.
Let us apply this (larger!) comment-renaming fix, and see if it puts a larger dent in the number of unique comments:
violations_df["COMMENTS FIXED2"] = violations_df["VIOLATION INSPECTOR COMMENTS"].apply(renamecomment)
print("Records: ", len(violations_df))
print("Distinct records: ", len(violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts()) )
print("Distinct records fixed:", len(violations_df["COMMENTS FIXED2"].value_counts()))
Records: 1832
Distinct records: 1580
Distinct records fixed: 1534
Now, although there were 51 comments slated to be renamed, it only reduced the number of unique comments from 1580 to 1544 (by 47). The comments are still 5.3% not-a-number, but now 11.0% have non-unique comments.
So we were able to “clean up” 54 out of 1832 (2.9%) of the comment records. The effect on the unique comments is a bit more dramatic than the effect on distinct comments. Unique comments dropped from
print ("Unique comments before", violations_df["VIOLATION INSPECTOR COMMENTS"].value_counts().value_counts()[1])
print ("Unique comments after ", violations_df["COMMENTS FIXED2"].value_counts().value_counts()[1])
print ("Out of ", len(violations_df))
Unique comments before 1451
Unique comments after 1377
Out of 1832
The renaming-cleaning procedure reduced the fraction of the data in unique-comment categories by 4%. These are records that can presumably be better matched with other rows with similar content.
Clean data#
This exercise tried to clean up (meaningless) typographic differences in comment fields; in real-world problems, you are likely to encounter two databases with sometimes-matching and sometimes-mismatching data on more than one field. You might have a database of people from the HR department and another from the directory, or might have a database of addresses from a municipal database and a commercial database.
Some fields match more reliably than others; if your job is to somehow retrieve the most likely N matches to a test query, your solution will likely try to match on high-quality identifiers first (national ID numbers for people, FIPS codes for places, transaction ID numbers where appropriate) and attempt fuzzy matching only after exhausting the pool of more-likely-correct hits on identifiers, using a score that is a mixture of fuzzy-matching goodness-of-match scores from different fields.