[英]Fuzzy matching inside a column
假设我有一个像这样的运动清单:
sports=["futball","fitbal","football","tennis","tenis","tenisse","footbal","zennis","ping-pong"]
我想创建一个数据框,如果模糊匹配值大于0.5,并且不只是与自身匹配,则将最接近运动的每个元素匹配。 (我想为此使用函数fuzzywuzzy.fuzz.ratio(x,y))
结果应如下所示:
pd.DataFrame({"sport":sports,"closest_match":["futball","futball","football","tennis","tennis","tennis","futball","tennis","ping-pong"]})
sport closest_match
0 futball futball
1 fitbal futball
2 football football
3 tennis tennis
4 tenis tennis
5 tenisse tennis
6 footbal futball
7 zennis tennis
8 ping-pong ping-pong
谢谢
这是使用itertools.combinations的解决方案:
from fuzzywuzzy import fuzz
import pandas as pd
sports = ["futball", "fitbal", "football", "tennis", "tenis", "tenisse", "footbal", "zennis", "ping-pong"]
dist = ([x for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
df = pd.DataFrame(dist, columns=["sport","closest"])
df['ratio'] = dist = ([fuzz.ratio(*x) for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
print(df)
df = df.groupby(['sport'])[['closest','ratio']].agg('max').reset_index()
输出:
sport closest ratio
0 fitbal football 77
1 football footbal 93
2 futball football 80
3 tenis zennis 83
4 tenisse zennis 62
5 tennis zennis 91
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.