简体   繁体   中英

How to do fuzzy match merge to match based on a few columns

I have web scraped some store infos from 2 websites and thus have 2 dataframes and I'd like to merge them to a full one. I have to match them through at least 2 columns such as store code and name. the example datasets look like this

store code name phone email website
A KFC 111-111-1111 asdsa@as.com aaaaa.com
A3 Mc
B1 KFC 222-222-2222
store code2 name2 phone2 email2 website2
A Kfc +1111111111 asdsa@as.com aaaaa.com
A Pizzahut
B1 KFC +2222222222 qwerty@kfc.com

what I want may look like this

store code name phone email website
A KFC 111-111-1111 asdsa@as.com aaaaa.com
A Pizzahut
A3 Mc
B1 KFC +2222222222 qwerty@kfc.com

Solution one:

If your data is as clean as you claim (there are no typo in the names in the example), then you can do this:

# Cleaning the capitalization error
df1["name"] = df1["name"].str.lower()
df2["name"] = df2["name"].str.lower()
 
df_total = df1.append(df2,ignore_index=True)

df_total = df_total.groupby(["store code","name"]).first()

Solution two (if you have typo in the string values):

But if there are typo in the names and you want to merge them according to fuzzy matching, then you need to follow this:

  1. We need these libraries to help us:

import pandas as pd import.networkx as nx from fuzzywuzzy import fuzz import itertools from itertools import permutations

Lets match the cases so we are on the safe side:

df1["name"] = df1["name"].str.lower()
df2["name"] = df2["name"].str.lower()

Then lets start matching!

We need to make all combinations of the two names in dataframes ( source ) and make a dataframe out of it so we can use apply that is much faster than for loop:

combs = list(itertools.product(df1["name"], df2["name"]))
combs = pd.DataFrame(combs)

Then we score each combination. The WRatio will do just fine, but you can use your custom made functions for matching:

combs['score'] = combs.apply(lambda x: fuzz.WRatio(x[0],x[1]), axis=1)

Now, lets make a graph out of it. I used the min score of 90 as the criteria. you can use which ever that suits you the best:

threshold = 90
G_name = nx.from_pandas_edgelist(combs[combs['score']>=threshold],0,1, create_using=nx.Graph)

If names fit the matching criteria, then they will become connected in our graph. So each interconnected cluster represent same name. With this information we can create a dictionary that replaces all deviations of a single name in our data to a unique one.

This code is a bit complex. In short, it creates a dataframe which each row is one name and for columns has its variations. Then it melts the dataframe and create a dictionary that has deviation of names as key and the unique representation of a name as value. This dictionary allows us to replace all deviated names in your dataframe with unique one so the groupby can function correctly:

connected_names=pd.DataFrame()
for cluster in nx.connected_components(G_name):
    if len(cluster) != 1:
        connected_names = connected_names.append([list(cluster)])
connected_names = connected_names\
    .reset_index(drop=True)\
        .melt(id_vars=0)\
            .drop('variable', axis=1)\
                .dropna()\
                    .reset_index(drop=True)\
                        .set_index('value')

names_dict = connected_names.to_dict()[0]

Now we have the dictionary. All that remains is replacing the names and use the groupby method:

df1["name"] = df1["name"].replace(names_dict)
df2["name"] = df2["name"].replace(names_dict)

df_total = df1.append(df2,ignore_index=True)
    
df_total = df_total.groupby(["store code","name"]).first()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM