简体   繁体   中英

How can I match entries in pandas dataframes using multiple criteria and fuzzy logic?

Thank you for your help. I believe that this is a common problem, but I'm unable to find a solution on SO that addresses this partiuclar form of it. I'm a newer programmer and deeply appreciative of any assistance.

I have two sets of data on healthcare companies. Data in df1 is messy and contains null values, while data in df2 is a lot more complete.

I need to match the companies in df1 and df2 , determine if there is a match, and if not a direct match, how close of a match it is. Both sets are of tens of thousands of companies and change/update daily, so I'm trying to build something that scales

Here is a reproducible program of what I've tried so far:

import pandas as pd
from fuzzywuzzy import process

data1 = [['1001', 'Lutheran Family Hospital', 'Omaha', 'NE'],
         ['1020', 'Lutheran Family Hospital', 'Dallas', 'TX'],
         ['1021', 'Lutheran Regional Family Hospital', 'Des Plaines', 'IL'],
         ['1002', 'Independent Health', 'Fairbanks', 'AK'],
         ['1003', 'Lucky You Community Clinic', '', ''],
         ['1004', 'Belmont General Hospital', 'Belmont', 'CA'],
         ['1005', 'Louisiana Chiro', 'Lafayette', 'LA'],
         ['1006', 'Steven, Even', 'Chicago', 'IL'],
         ['1007', 'Kind Kare 4 Kids', 'New Mexico', 'New Mexico'],
         ['1008', 'Independence Mem', '', ''],
         ['1009', 'Gerald Griffin Health', 'Missoula', 'Montana'],
         ['1010', 'INTERNAL MED', 'CHARLESTON', 'SC'],
         ['1011', 'Belmont Hospital', '', ''],
         ['1012', 'Belmont Gnrl', 'Belmont', 'CA'],
         ['1013', 'St Mary Rehab', '', ''],
         ['1014', 'Saint Mary Med Center', 'Los Angeles', 'California'],
         ['1025', "St. Mary's Of Lourdes Regional Medical Center", 'Lincoln', 'NE'],
         ['1015', 'Bryan Bennington, MD', 'Huntsville', 'AL']]

data2 = [['1', 'Lutheran General Hospital', 'Fort Wayne', 'IN'],
         ['2', 'Lutheran Family Hospital', 'Omaha', 'NE'],
         ['3', 'Independence Memorial Health', 'Fairbanks', 'AK'],
         ['4', 'Lucky-You Community Clinic', 'New York', 'NY'],
         ['5', 'Belmont General Hospital', 'Belmont', 'CA'],
         ['6', 'Lafayette Joints R Us (DBA Louisiana Best Chiropractic)', 'Lafayette', 'LA'],
         ['7', 'Even Steven, MD', 'Chicago', 'IL'],
         ['8', 'Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care)', 'Albequerque', 'NM'],
         ['9', 'The Best Doctor Group', 'Philadelphia', 'PA'],
         ['10', 'Internal Medical Group, PLLC', 'Charleston', 'SC'],
         ['11', "Saint Mary's Holy Name Rehabilitation", 'Lexington', 'KY'],
         ['12', 'St. Mary Regional Medical Center', 'Los Angeles', 'CA'],
         ['13', 'Advanced Outpatient Surgical Center', 'Seattle', 'WA']]

df1 = pd.DataFrame(data1, columns=['ID', 'Org_Name', 'City', 'State'])
df2 = pd.DataFrame(data2, columns=['ID', 'Org_Name', 'City', 'State'])

i = 0
scorethreshold = 80
df1["fuzzy"] = 0
for x in df1.Org_Name:
    noun,score,record = process.extractOne(x,df2.Org_Name)
    if score > scorethreshold:     
        df1.loc[i,'fuzzy'] = noun
    else:
        df1.loc[i,'fuzzy'] = None
    i = i + 1

The above produces the following result:

+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
|    |  ID  |                   Org_Name                    |    City     |   State    |                          fuzzy                          |
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
|  0 | 1001 | Lutheran Family Hospital                      | Omaha       | NE         | Lutheran Family Hospital                                |
|  1 | 1020 | Lutheran Family Hospital                      | Dallas      | TX         | Lutheran Family Hospital                                |
|  2 | 1021 | Lutheran Regional Family Hospital             | Des Plaines | IL         | Lutheran Family Hospital                                |
|  3 | 1002 | Independent Health                            | Fairbanks   | AK         | Independence Memorial Health                            |
|  4 | 1003 | Lucky You Community Clinic                    |             |            | Lucky-You Community Clinic                              |
|  5 | 1004 | Belmont General Hospital                      | Belmont     | CA         | Belmont General Hospital                                |
|  6 | 1005 | Louisiana Chiro                               | Lafayette   | LA         | Lafayette Joints R Us (DBA Louisiana Best Chiropractic) |
|  7 | 1006 | Steven, Even                                  | Chicago     | IL         | Even Steven, MD                                         |
|  8 | 1007 | Kind Kare 4 Kids                              | New Mexico  | New Mexico | Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care)        |
|  9 | 1008 | Independence Mem                              |             |            | Independence Memorial Health                            |
| 10 | 1009 | Gerald Griffin Health                         | Missoula    | Montana    |                                                         |
| 11 | 1010 | INTERNAL MED                                  | CHARLESTON  | SC         | Internal Medical Group, PLLC                            |
| 12 | 1011 | Belmont Hospital                              |             |            | Lutheran General Hospital                               |
| 13 | 1012 | Belmont Gnrl                                  | Belmont     | CA         | Belmont General Hospital                                |
| 14 | 1013 | St Mary Rehab                                 |             |            | Saint Mary's Holy Name Rehabilitation                   |
| 15 | 1014 | Saint Mary Med Center                         | Los Angeles | California | Saint Mary's Holy Name Rehabilitation                   |
| 16 | 1025 | St. Mary's Of Lourdes Regional Medical Center | Lincoln     | NE         | St. Mary Regional Medical Center                        |
| 17 | 1015 | Bryan Bennington, MD                          | Huntsville  | AL         |                                                         |
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+

However, I'm trying to create something whereby I can determine whether not only company names match, but cities and states match too, and how closely all of this matches. I'm trying to create an output more like this, where Fuzzy_ID refers to the index location of the matching entry, and Matched? refers to a Boolean judgment:

+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
|   |  ID  |             Org_Name              |    City     | State | Fuzzy_ID |   Score    | Matched? |
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
| 0 | 1001 | Lutheran Family Hospital          | Omaha       | NE    |        2 | 100        | YES      |
| 1 | 1020 | Lutheran Family Hospital          | Dallas      | TX    |        2 | some_score | NO       |
| 2 | 1021 | Lutheran Regional Family Hospital | Des Plaines | IL    |        2 | some_score | NO       |
| 3 | 1002 | Independent Health                | Fairbanks   | AK    |        3 | some_score | YES      |
| 4 | 1003 | Lucky You Community Clinic        |             |       |        4 | some_score | YES      |
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+

How can this be accomplished? What methods exist that are better suited to what needs to be accomplished? Very grateful for any help provided.

This task is quite difficult and involves a number of steps, but at least I attempt to lay out some general principles.

Start from tidying up the state column. If somewhere there is full name of a state, replace it with state code.

Maybe you should also take some time to clarify "No state" cases in df1 , as another step to clean the data.

Then, for each row in df1 , attempt to find the best matching row in df2 . To do it, use the following procedure:

  1. Using process.extract , find in df2 a pool of best matches, by name , with the current row , assuming some values for limit and score_cutoff . If row contains state , check in df2 only rows from this state. Save match ratio for each match found as name_ratio .

  2. For each item from the above pool, compute WRatio on city column, saving it as city_ratio .

  3. Use some aggregation formula, to compute total_ratio for each match from name_ratio and city_ratio . I'm also not sure how this formula should be.

  4. Take the match with maximal total_ratio , but if this (best) ratio is below some total_ratio_cutoff , assume that the current row has no match .

Of course, it remains to you to experiment with values of particular parameters and look how changes in their values affect the final result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM