[英]Compute distance between strings of two pandas DataFrames
I have 2 data frames: 我有2个数据框:
df1:
Date Name Num
2013-11-24 Banana 22.1
2013-11-24 Orange 8.6
2013-11-24 Apple 7.6
2013-11-24 Celery 10.2
df2:
Date Name Num
2013-11-24 Celery 22.1
2013-11-24 0r@nge 8.6
2013-11-24 @ppl3 7.6
2013-11-24 BananaX 10.2
I want to find similar rows, for that, I need to find similarity of Name
between 2 data frame right now I am iterating each data frame and compute similarity with all the other rows of the other data frame (which is very time consuming) and find the maximum value and if it was greater than certain threshold I will do something with it. 我想找到相似的行,为此,我现在需要迭代2个数据帧之间的
Name
相似性,我正在迭代每个数据帧并计算与其他数据帧的所有其他行的相似性(这非常耗时),并且找到最大值,如果它大于某个阈值,我将对其进行处理。
dfResult = pd.DataFrame()
import pandas as pd
from fuzzywuzzy import fuzz
for indexD, rowD in dfD.iterrows():
for indexS, rowS in dfS.iterrows():
data = pd.DataFrame({"ratio": fuzz.token_set_ratio(rowD['Name'], rowS['Name']),
"indexD": rowD['Num'], "indexS": rowS['Num']}, index=[indexS])
maxMatch = dfTMP.loc[dfTMP['ratio'].idxmax()]
......
......
......
resultMatch = create_match_row(maxMatch, dfD, dfS)
After each iteration I am getting 每次迭代后,我得到
indexD 1
indexS 4
ratio 100
Name: 3, dtype: int64
1
indexD 2
indexS 1
ratio 35
Name: 0, dtype: int64
2
indexD 3
indexS 3
ratio 45
Name: 2, dtype: int64
3
indexD 4
indexS 4
ratio 33
Name: 3, dtype: int64
which the max function should return : max函数应该返回:
indexD 1
indexS 4
ratio 100
Which means row 1 from data frame 1 is similar to row 4 in data frame 2. 这意味着数据帧1中的第1行类似于数据帧2中的第4行。
I wanted to know is there any better way so I can compute the distance in one shot and remove the inner loop? 我想知道还有什么更好的方法,这样我就可以一次计算距离并消除内环吗? and find the best match for each row (
name
) in the first data frame with the second data frame? 并找到第一个数据帧与第二个数据帧中每一行(
name
)的最佳匹配?
Expected output: for each row in data frame one I like to get the data frame (just a simple index) that shows which row in data frame 2 is the most identical one. 预期的输出:对于数据帧中的每一行,我都希望获得数据帧(只是一个简单的索引),以显示数据帧2中的哪一行是最相同的一行。
IIUIC, Here's one way IIUIC,这是一种方法
In [3456]: def get_fuzz(df, w):
...: s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
...: idx = s.idxmax()
...: return {'name': df['Name'].iloc[idx], 'index': idx, 'val': s.max()}
...:
In [3457]: df1['Name'].apply(lambda x: get_fuzz(df2, x))
Out[3457]:
0 {u'index': 3, u'name': u'BananaX', u'val': 92}
1 {u'index': 1, u'name': u'0r@nge', u'val': 67}
2 {u'index': 2, u'name': u'@ppl3', u'val': 67}
3 {u'index': 0, u'name': u'Celery', u'val': 100}
Name: Name, dtype: object
assign
the result to df1
, if you need 如果需要,
assign
结果assign
给df1
In [3458]: df1.assign(search=df1['Name'].apply(lambda x: get_fuzz(df2, x)))
Out[3458]:
Date Name Num search
0 2013-11-24 Banana 22.1 {u'index': 3, u'name': u'BananaX', u'val': 92}
1 2013-11-24 Orange 8.6 {u'index': 1, u'name': u'0r@nge', u'val': 67}
2 2013-11-24 Apple 7.6 {u'index': 2, u'name': u'@ppl3', u'val': 67}
3 2013-11-24 Celery 10.2 {u'index': 0, u'name': u'Celery', u'val': 100}
Details 细节
In [3459]: df1
Out[3459]:
Date Name Num
0 2013-11-24 Banana 22.1
1 2013-11-24 Orange 8.6
2 2013-11-24 Apple 7.6
3 2013-11-24 Celery 10.2
In [3460]: df2
Out[3460]:
Date Name Num
0 2013-11-24 Celery 22.1
1 2013-11-24 0r@nge 8.6
2 2013-11-24 @ppl3 7.6
3 2013-11-24 BananaX 10.2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.