简体   繁体   English

我一直在尝试应用fuzzywuzzy包来解决查找欺诈条目的问题。 我如何在以下问题中应用相同的内容?

[英]I have been trying to apply fuzzywuzzy package to solve a problem to find fraud entries. How do i apply the same in the following problem?

I want to use fuzzywuzzy package on the following table我想在下表中使用fuzzywuzzy包

x   Reference   amount
121 TOR1234        500
121 T0R1234        500
121 W7QWER         500
121 W1QWER         500
141 TRYCATC        700
141 TRYCATC        700
151 I678MKV        300
151 1678MKV        300
  1. I want to group the table where the columns 'x' and 'amount' match.我想对列 'x' 和 'amount' 匹配的表进行分组。
  2. for each reference in the group i.对于组 i 中的每个参考。 Compare(fuzzywuzzy) with other references in that group.与该组中的其他引用进行比较(fuzzywuzzy)。 a.一种。 where the match is 100%, delete them b.如果匹配为 100%,则删除它们 b. where the match is 90-99.99%, keep them c.当匹配率为 90-99.99% 时,保留它们 c. delete anything below 90% match for that particular row the expected output-删除与该特定行匹配的低于 90% 的任何内容,即预期的输出-
 x   y     amount
151 I678MKV 300
151 1678MKV 300
121 TOR1234 500
121 T0R1234 500
121 W7QWER  500
121 W1QWER  500

This is to detect the fraud entries, Like in the tables, '1' is replaced by 'I' and '0' is replaced by 'O'.这是为了检测欺诈条目,就像在表中一样,“1”替换为“I”,“0”替换为“O”。 If you any alternative solution, please suggest.如果您有任何替代解决方案,请提出建议。

What I have understand you don't need fuzzywuzzy package approach use simple drop_duplicates with keep=False据我了解,您不需要使用with keep=False简单drop_duplicates fuzzywuzzy包方法

df = pd.DataFrame(data={"x":[121,121,121,121,141,141,151,151],
                   "Refrence":["TOR1234","T0R1234","W7QWER","W1QWER","TRYCATC","TRYCATC"
                               ,"I678MKV","1678MKV"],
                   "amount":[500,500,500,500,700,700,300,300]})
res = df.drop_duplicates(['x','Refrence','amount'],keep=False).sort_values(['x'],ascending=[False])

print(res)
     x Refrence  amount
6  151  I678MKV     300
7  151  1678MKV     300
0  121  TOR1234     500
1  121  T0R1234     500
2  121   W7QWER     500
3  121   W1QWER     500

Apply levenshtein distance on the Refrence within the same x在相同 x 内的参考上应用 levenshtein 距离

from itertools import combinations
from similarity.damerau import Damerau
levenshtien = Damerau()

data = list(combinations(res['Refrence'], 2))

refrence_df = pd.DataFrame(data,columns=['Refrence','Refrence2'])

refrence_df = pd.merge(refrence_df,df[['x','Refrence']],on=['Refrence'],how='left')
refrence_df = pd.merge(refrence_df,df[['x','Refrence']],left_on=['Refrence2'],right_on=['Refrence'],how='left')

refrence_df.rename(columns={'x_x':'x_1','x_y':'x_2','Refrence_x':'Refrence'},inplace=True)

refrence_df.drop(['Refrence_y'],axis=1,inplace=True)

refrence_df = refrence_df[refrence_df['x_1']==refrence_df['x_2']]

refrence_df['edit_required'] = refrence_df.apply(lambda x: levenshtien.distance(x['Refrence'],x['Refrence2']),
                                                   axis=1)

refrence_df['characters_not_common'] = refrence_df.apply(lambda x :list(set(x['Refrence'])-set(x['Refrence2'])),axis=1)
print(refrence_df)
    Refrence Refrence2  x_1  x_2  edit_required characters_not_common
0   I678MKV   1678MKV  151  151              1                   [I]
9   TOR1234   T0R1234  121  121              1                   [O]
10  TOR1234    W7QWER  121  121              7    [O, T, 1, 3, 2, 4]
11  TOR1234    W1QWER  121  121              7       [O, T, 3, 2, 4]
12  T0R1234    W7QWER  121  121              7    [T, 1, 0, 3, 2, 4]
13  T0R1234    W1QWER  121  121              7       [T, 0, 3, 2, 4]
14   W7QWER    W1QWER  121  121              1                   [7]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM