簡體   English   中英

python中數據幀中列中的模糊匹配

[英]Fuzzy match in a column in a dataframe in python

我有一個包含字符串的列。 我想做一個模糊匹配,並在旁邊的列中標記那些匹配度為 80% 的匹配。 我可以在較小的數據集上使用以下代碼執行此操作,但我的原始數據集太大而無法有效工作。 有一個更好的方法嗎? 這就是我所做的。

import pandas as pd

l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four'])

df['yes/no 2'] = ""

for i in range(0, df.shape[0]):
    for j in range(0, df.shape[0]):
        if (i != j):
            if (fuzz.token_sort_ratio(df.iloc[i,df.shape[1]-2],df.iloc[j,df.shape[1]-2]) > 80):
                df.iloc[i,df.shape[1]-1] = "yes"
import pandas as pd
from fuzzywuzzy import fuzz

l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four'])

def match(row):
    thresh = 80
    return fuzz.token_sort_ratio(row["two"],row["three"])>thresh


df["Yes/No"] = df.apply(match,axis=1)
print(df)

   Serial No one two three                 four  Yes/No
0          1   a   b     c             help pls   False
1          2   a   c     c                 yooo    True
2          3   a   c     c    you will not pass    True
3          4   a   b     b   You shall not pass    True
4          5   a   c     c  You shall not pass!    True
import pandas as pd
from fuzzywuzzy import fuzz,process

l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four']).reset_index()

def match(df,col):
    thresh = 80
    return df[col].apply(lambda x:"Yes" if len(process.extractBests(x[1],[xx[1] for i,xx in enumerate(df[col]) if i!=x[0]],
            scorer=fuzz.token_sort_ratio,score_cutoff=thresh+1,limit=1))>0 else "No")  


df["five"] = df.apply(lambda x:(x["index"],x["four"]),axis=1)
df["Yes/No"] = df.pipe(match,"five")
print(df)


   index  Serial No one two three                 four                      five Yes/No
0      0          1   a   b     c             help pls             (0, help pls)     No
1      1          2   a   c     c                 yooo                 (1, yooo)     No
2      2          3   a   c     c    you will not pass    (2, you will not pass)    Yes
3      3          4   a   b     b   You shall not pass   (3, You shall not pass)    Yes
4      4          5   a   c     c  You shall not pass!  (4, You shall not pass!)    Yes

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM