简体   繁体   English

计算两个熊猫数据框的字符串之间的距离

[英]Compute distance between strings of two pandas DataFrames

I have 2 data frames: 我有2个数据框:

df1:
Date       Name   Num  
2013-11-24 Banana 22.1 
2013-11-24 Orange  8.6 
2013-11-24 Apple   7.6 
2013-11-24 Celery 10.2 

df2:
Date       Name   Num  
2013-11-24 Celery 22.1 
2013-11-24 0r@nge  8.6 
2013-11-24 @ppl3   7.6 
2013-11-24 BananaX 10.2 

I want to find similar rows, for that, I need to find similarity of Name between 2 data frame right now I am iterating each data frame and compute similarity with all the other rows of the other data frame (which is very time consuming) and find the maximum value and if it was greater than certain threshold I will do something with it. 我想找到相似的行,为此,我现在需要迭代2个数据帧之间的Name相似性,我正在迭代每个数据帧并计算与其他数据帧的所有其他行的相似性(这非常耗时),并且找到最大值,如果它大于某个阈值,我将对其进行处理。

dfResult = pd.DataFrame()
import pandas as pd
from fuzzywuzzy import fuzz
for indexD, rowD in dfD.iterrows():
    for indexS, rowS in dfS.iterrows():
        data = pd.DataFrame({"ratio": fuzz.token_set_ratio(rowD['Name'], rowS['Name']),
                             "indexD": rowD['Num'], "indexS": rowS['Num']}, index=[indexS])
    maxMatch = dfTMP.loc[dfTMP['ratio'].idxmax()]
    ......
    ......
    ......
    resultMatch = create_match_row(maxMatch, dfD, dfS)

After each iteration I am getting 每次迭代后,我得到

indexD      1
indexS      4
ratio     100
Name: 3, dtype: int64
1
indexD     2
indexS     1
ratio     35
Name: 0, dtype: int64
2
indexD     3
indexS     3
ratio     45
Name: 2, dtype: int64
3
indexD     4
indexS     4
ratio     33
Name: 3, dtype: int64

which the max function should return : max函数应该返回:

    indexD      1
    indexS      4
    ratio     100

Which means row 1 from data frame 1 is similar to row 4 in data frame 2. 这意味着数据帧1中的第1行类似于数据帧2中的第4行。

I wanted to know is there any better way so I can compute the distance in one shot and remove the inner loop? 我想知道还有什么更好的方法,这样我就可以一次计算距离并消除内环吗? and find the best match for each row ( name ) in the first data frame with the second data frame? 并找到第一个数据帧与第二个数据帧中每一行( name )的最佳匹配?

Expected output: for each row in data frame one I like to get the data frame (just a simple index) that shows which row in data frame 2 is the most identical one. 预期的输出:对于数据帧中的每一行,我都希望获得数据帧(只是一个简单的索引),以显示数据帧2中的哪一行是最相同的一行。

IIUIC, Here's one way IIUIC,这是一种方法

In [3456]: def get_fuzz(df, w):
      ...:     s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
      ...:     idx = s.idxmax()
      ...:     return {'name': df['Name'].iloc[idx], 'index': idx, 'val': s.max()}
      ...:

In [3457]: df1['Name'].apply(lambda x: get_fuzz(df2, x))
Out[3457]:
0    {u'index': 3, u'name': u'BananaX', u'val': 92}
1     {u'index': 1, u'name': u'0r@nge', u'val': 67}
2      {u'index': 2, u'name': u'@ppl3', u'val': 67}
3    {u'index': 0, u'name': u'Celery', u'val': 100}
Name: Name, dtype: object

assign the result to df1 , if you need 如果需要, assign结果assigndf1

In [3458]: df1.assign(search=df1['Name'].apply(lambda x: get_fuzz(df2, x)))
Out[3458]:
         Date    Name   Num                                          search
0  2013-11-24  Banana  22.1  {u'index': 3, u'name': u'BananaX', u'val': 92}
1  2013-11-24  Orange   8.6   {u'index': 1, u'name': u'0r@nge', u'val': 67}
2  2013-11-24   Apple   7.6    {u'index': 2, u'name': u'@ppl3', u'val': 67}
3  2013-11-24  Celery  10.2  {u'index': 0, u'name': u'Celery', u'val': 100}

Details 细节

In [3459]: df1
Out[3459]:
         Date    Name   Num
0  2013-11-24  Banana  22.1
1  2013-11-24  Orange   8.6
2  2013-11-24   Apple   7.6
3  2013-11-24  Celery  10.2

In [3460]: df2
Out[3460]:
         Date     Name   Num
0  2013-11-24   Celery  22.1
1  2013-11-24   0r@nge   8.6
2  2013-11-24    @ppl3   7.6
3  2013-11-24  BananaX  10.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM