I have web scraped some store infos from 2 websites and thus have 2 dataframes and I'd like to merge them to a full one. I have to match them through at least 2 columns such as store code and name. the example datasets look like this
store code | name | phone | website | |
---|---|---|---|---|
A | KFC | 111-111-1111 | asdsa@as.com | aaaaa.com |
A3 | Mc | |||
B1 | KFC | 222-222-2222 |
store code2 | name2 | phone2 | email2 | website2 |
---|---|---|---|---|
A | Kfc | +1111111111 | asdsa@as.com | aaaaa.com |
A | Pizzahut | |||
B1 | KFC | +2222222222 | qwerty@kfc.com |
what I want may look like this
store code | name | phone | website | |
---|---|---|---|---|
A | KFC | 111-111-1111 | asdsa@as.com | aaaaa.com |
A | Pizzahut | |||
A3 | Mc | |||
B1 | KFC | +2222222222 | qwerty@kfc.com |
Solution one:
If your data is as clean as you claim (there are no typo in the names in the example), then you can do this:
# Cleaning the capitalization error
df1["name"] = df1["name"].str.lower()
df2["name"] = df2["name"].str.lower()
df_total = df1.append(df2,ignore_index=True)
df_total = df_total.groupby(["store code","name"]).first()
Solution two (if you have typo in the string values):
But if there are typo in the names and you want to merge them according to fuzzy matching, then you need to follow this:
import pandas as pd import.networkx as nx from fuzzywuzzy import fuzz import itertools from itertools import permutations
Lets match the cases so we are on the safe side:
df1["name"] = df1["name"].str.lower()
df2["name"] = df2["name"].str.lower()
Then lets start matching!
We need to make all combinations of the two names in dataframes ( source ) and make a dataframe out of it so we can use apply that is much faster than for loop:
combs = list(itertools.product(df1["name"], df2["name"]))
combs = pd.DataFrame(combs)
Then we score each combination. The WRatio
will do just fine, but you can use your custom made functions for matching:
combs['score'] = combs.apply(lambda x: fuzz.WRatio(x[0],x[1]), axis=1)
Now, lets make a graph out of it. I used the min score of 90 as the criteria. you can use which ever that suits you the best:
threshold = 90
G_name = nx.from_pandas_edgelist(combs[combs['score']>=threshold],0,1, create_using=nx.Graph)
If names fit the matching criteria, then they will become connected in our graph. So each interconnected cluster represent same name. With this information we can create a dictionary that replaces all deviations of a single name in our data to a unique one.
This code is a bit complex. In short, it creates a dataframe which each row is one name and for columns has its variations. Then it melts the dataframe and create a dictionary that has deviation of names as key and the unique representation of a name as value. This dictionary allows us to replace all deviated names in your dataframe with unique one so the groupby
can function correctly:
connected_names=pd.DataFrame()
for cluster in nx.connected_components(G_name):
if len(cluster) != 1:
connected_names = connected_names.append([list(cluster)])
connected_names = connected_names\
.reset_index(drop=True)\
.melt(id_vars=0)\
.drop('variable', axis=1)\
.dropna()\
.reset_index(drop=True)\
.set_index('value')
names_dict = connected_names.to_dict()[0]
Now we have the dictionary. All that remains is replacing the names and use the groupby
method:
df1["name"] = df1["name"].replace(names_dict)
df2["name"] = df2["name"].replace(names_dict)
df_total = df1.append(df2,ignore_index=True)
df_total = df_total.groupby(["store code","name"]).first()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.