[英]Pandas Series and Nan Values for mismatched values
I have these two dictionaries,我有这两个字典,
dico = {'Name': ['Arthur','Henri','Lisiane','Patrice','Zadig','Sacha'],
"Age": ["20","18","62","73",'21','20'],
"Studies": ['Economics','Maths','Psychology','Medical','Cinema','CS']
}
dico2 = {'Surname': ['Arthur1','Henri2','Lisiane3','Patrice4']}
dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)
in which I would like to match then append the Surname column with the Name column, to finally append it to dico, for a following output:我想在其中匹配 append 姓氏列与名称列,最后将 append 与 dico 匹配,以获得以下 output:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Nan 73 Medical
4 Zadig Nan 21 Cinema
5 Sacha Nan 20 CS
and ultimately delete the rows for which Surname is Nan
并最终删除 Surname 为Nan
的行
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
map_list = []
for name in dico['Name']:
best_ratio = None
for idx, surname in enumerate(dico2['Surname']):
if best_ratio == None:
best_ratio = fuzz.ratio(name, surname)
best_idx = 0
else:
ratio = fuzz.ratio(name, surname)
if ratio > best_ratio:
best_ratio = ratio
best_idx = idx
map_list.append(dico2['Surname'][best_idx]) # obtain surname
dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns
#if the surname is not a great match, print "Nan"
dico = dico.drop(dico[dico.Surname == "NaN"].index)
but when I print(dico)
, the output is as follows:但是当我print(dico)
时, output 如下:
Name Surname Age Studies
0 Arthur Arthur1 20 Economics
1 Henri Henri2 18 Maths
2 Lisiane Lisiane3 62 Psychology
3 Patrice Patrice4 73 Medical
4 Zadig Patrice4 21 Cinema
5 Sacha Patrice4 20 CS
I don't see why after the Patrice row, there's a mismatch, while I want it to be "Nan".我不明白为什么在 Patrice 行之后会出现不匹配,而我希望它是“Nan”。
You could do the following thing.你可以做以下事情。 Define the function:定义 function:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['Surname'] = m
m2 = df_1['Surname'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['Surname'] = m2
return df_1
and run并运行
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = fuzzy_merge(dico, dico2, 'Name', 'Surname',threshold=90, limit=2)
This returns:这将返回:
Name Age Studies Surname
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
4 Zadig 21 Cinema
5 Sacha 20 CS
Lets try pd.Multiindex.from_product
to create combinations and then assign a score with zip
and fuzz.ratio
and some filtering to create our dict, then we can use series.map
and df.dropna
:让我们尝试pd.Multiindex.from_product
创建组合,然后使用zip
和fuzz.ratio
以及一些过滤来创建我们的 dict ,然后我们可以使用series.map
和df.dropna
:
from fuzzywuzzy import fuzz
comb = pd.MultiIndex.from_product((dico['Name'],dico2['Surname']))
scores = comb.map(lambda x: fuzz.ratio(*x)) #or fuzz.partial_ratio(*x)
d = dict(a for a,b in zip(comb,scores) if b>90) #change threshold
out = dico.assign(SurName=dico['Name'].map(d)).dropna(subset=['SurName'])
print(out)
Name Age Studies SurName
0 Arthur 20 Economics Arthur1
1 Henri 18 Maths Henri2
2 Lisiane 62 Psychology Lisiane3
3 Patrice 73 Medical Patrice4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.