[英]I have been trying to apply fuzzywuzzy package to solve a problem to find fraud entries. How do i apply the same in the following problem?
I want to use fuzzywuzzy package on the following table我想在下表中使用fuzzywuzzy包
x Reference amount
121 TOR1234 500
121 T0R1234 500
121 W7QWER 500
121 W1QWER 500
141 TRYCATC 700
141 TRYCATC 700
151 I678MKV 300
151 1678MKV 300
x y amount
151 I678MKV 300
151 1678MKV 300
121 TOR1234 500
121 T0R1234 500
121 W7QWER 500
121 W1QWER 500
This is to detect the fraud entries, Like in the tables, '1' is replaced by 'I' and '0' is replaced by 'O'.这是为了检测欺诈条目,就像在表中一样,“1”替换为“I”,“0”替换为“O”。 If you any alternative solution, please suggest.
如果您有任何替代解决方案,请提出建议。
What I have understand you don't need fuzzywuzzy
package approach use simple drop_duplicates
with keep=False
据我了解,您不需要使用
with keep=False
简单drop_duplicates
fuzzywuzzy
包方法
df = pd.DataFrame(data={"x":[121,121,121,121,141,141,151,151],
"Refrence":["TOR1234","T0R1234","W7QWER","W1QWER","TRYCATC","TRYCATC"
,"I678MKV","1678MKV"],
"amount":[500,500,500,500,700,700,300,300]})
res = df.drop_duplicates(['x','Refrence','amount'],keep=False).sort_values(['x'],ascending=[False])
print(res)
x Refrence amount
6 151 I678MKV 300
7 151 1678MKV 300
0 121 TOR1234 500
1 121 T0R1234 500
2 121 W7QWER 500
3 121 W1QWER 500
from itertools import combinations
from similarity.damerau import Damerau
levenshtien = Damerau()
data = list(combinations(res['Refrence'], 2))
refrence_df = pd.DataFrame(data,columns=['Refrence','Refrence2'])
refrence_df = pd.merge(refrence_df,df[['x','Refrence']],on=['Refrence'],how='left')
refrence_df = pd.merge(refrence_df,df[['x','Refrence']],left_on=['Refrence2'],right_on=['Refrence'],how='left')
refrence_df.rename(columns={'x_x':'x_1','x_y':'x_2','Refrence_x':'Refrence'},inplace=True)
refrence_df.drop(['Refrence_y'],axis=1,inplace=True)
refrence_df = refrence_df[refrence_df['x_1']==refrence_df['x_2']]
refrence_df['edit_required'] = refrence_df.apply(lambda x: levenshtien.distance(x['Refrence'],x['Refrence2']),
axis=1)
refrence_df['characters_not_common'] = refrence_df.apply(lambda x :list(set(x['Refrence'])-set(x['Refrence2'])),axis=1)
print(refrence_df)
Refrence Refrence2 x_1 x_2 edit_required characters_not_common
0 I678MKV 1678MKV 151 151 1 [I]
9 TOR1234 T0R1234 121 121 1 [O]
10 TOR1234 W7QWER 121 121 7 [O, T, 1, 3, 2, 4]
11 TOR1234 W1QWER 121 121 7 [O, T, 3, 2, 4]
12 T0R1234 W7QWER 121 121 7 [T, 1, 0, 3, 2, 4]
13 T0R1234 W1QWER 121 121 7 [T, 0, 3, 2, 4]
14 W7QWER W1QWER 121 121 1 [7]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.