[英]Fuzzy match in a column in a dataframe in python
I have a column that has strings.我有一个包含字符串的列。 I want to do a fuzzy match and mark those which have an 80% match in a column next to it.我想做一个模糊匹配,并在旁边的列中标记那些匹配度为 80% 的匹配。 I can do it with the following code on a smaller dataset but my original dataset is too big for this to work efficiently.我可以在较小的数据集上使用以下代码执行此操作,但我的原始数据集太大而无法有效工作。 Is there a better way to do this?有一个更好的方法吗? This is what I have done.这就是我所做的。
import pandas as pd
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four'])
df['yes/no 2'] = ""
for i in range(0, df.shape[0]):
for j in range(0, df.shape[0]):
if (i != j):
if (fuzz.token_sort_ratio(df.iloc[i,df.shape[1]-2],df.iloc[j,df.shape[1]-2]) > 80):
df.iloc[i,df.shape[1]-1] = "yes"
import pandas as pd
from fuzzywuzzy import fuzz
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four'])
def match(row):
thresh = 80
return fuzz.token_sort_ratio(row["two"],row["three"])>thresh
df["Yes/No"] = df.apply(match,axis=1)
print(df)
Serial No one two three four Yes/No
0 1 a b c help pls False
1 2 a c c yooo True
2 3 a c c you will not pass True
3 4 a b b You shall not pass True
4 5 a c c You shall not pass! True
import pandas as pd
from fuzzywuzzy import fuzz,process
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four']).reset_index()
def match(df,col):
thresh = 80
return df[col].apply(lambda x:"Yes" if len(process.extractBests(x[1],[xx[1] for i,xx in enumerate(df[col]) if i!=x[0]],
scorer=fuzz.token_sort_ratio,score_cutoff=thresh+1,limit=1))>0 else "No")
df["five"] = df.apply(lambda x:(x["index"],x["four"]),axis=1)
df["Yes/No"] = df.pipe(match,"five")
print(df)
index Serial No one two three four five Yes/No
0 0 1 a b c help pls (0, help pls) No
1 1 2 a c c yooo (1, yooo) No
2 2 3 a c c you will not pass (2, you will not pass) Yes
3 3 4 a b b You shall not pass (3, You shall not pass) Yes
4 4 5 a c c You shall not pass! (4, You shall not pass!) Yes
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.