简体   繁体   English

熊猫根据最接近的匹配合并数据帧

[英]Pandas merge dataframes based on closest match

I have the following 2 dataframes (df_a,df_b): 我有以下2个数据帧(df_a,df_b):

df_a

    N0_YLDF
0   11.79
1   7.86
2   5.78
3   5.35
4   6.32
5   11.79
6   6.89
7   10.74


df_b
    N0_YLDF N0_DWOC
0   6.29    4
1   2.32    4
2   9.10    4
3   4.89    4
4   10.22   4
5   3.80    3
6   5.55    3
7   6.36    3

I would like to add a column N0_DWOC in df_a, such that the value in that column is from the row where df_a['N0_YLDF'] is closest to df_b['N0_YLDF']. 我想在df_a中添加一列N0_DWOC,以使该列中的值来自df_a ['N0_YLDF']最接近df_b ['N0_YLDF']的行。

Right now, I am doing a simple merge but that does not do what I want 现在,我正在做一个简单的合并,但这并不能满足我的要求

You could find the cutoff values which are midway between the (sorted) values in df_b['N0_YLDF'] . 您可以在df_b['N0_YLDF']的(排序的)值之间找到中间值。 Then call pd.cut to categorize the values in df_a['N0_YLDF'] , with the cutoff values being the bin edges: 然后调用pd.cut来对df_a['N0_YLDF']的值进行分类,其中临界值是bin边缘:

import numpy as np
import pandas as pd

df_a = pd.DataFrame({ 'N0_YLDF': [11.79, 7.86, 5.78, 5.35, 6.32, 11.79, 6.89, 10.74]})
df_b = pd.DataFrame({ 'N0_YLDF':[6.29, 2.32, 9.10, 4.89, 10.22, 3.80, 5.55, 6.36] })

edges, labels = np.unique(df_b['N0_YLDF'], return_index=True)
edges = np.r_[-np.inf, edges + np.ediff1d(edges, to_end=np.inf)/2]
df_a['N0_DWOC'] = pd.cut(df_a['N0_YLDF'], bins=edges, labels=df_b.index[labels])
print(df_a)

yields 产量

In [293]: df_a
Out[293]: 
   N0_YLDF N0_DWOC
0    11.79       4
1     7.86       2
2     5.78       6
3     5.35       6
4     6.32       0
5    11.79       4
6     6.89       7
7    10.74       4

To join the two DataFrames on N0_DWOC you could use: 要加入的两个DataFrames N0_DWOC你可以使用:

print(df_a.join(df_b, on='N0_DWOC', rsuffix='_b'))

which yields 产生

   N0_YLDF N0_DWOC  N0_YLDF_b
0    11.79       4      10.22
1     7.86       2       9.10
2     5.78       6       5.55
3     5.35       6       5.55
4     6.32       0       6.29
5    11.79       4      10.22
6     6.89       7       6.36
7    10.74       4      10.22

Another way is to do an subtract all pairs in the cartesian product and get the index of minimum absolute value for each one: 另一种方法是对笛卡尔乘积中的所有对进行减法运算,并获得每个对的最小绝对值的索引:

In [47]:ix = abs(np.atleast_2d(df_a['N0_YLDF']).T - df_b['N0_YLDF'].values).argmin(axis=1)
        ix
Out[47]: array([4, 2, 6, 6, 0, 4, 7, 4])

Then do 然后做

df_a['N0_DWOC'] = df_b.ix[ix, 'N0_DWOC'].values

In [73]: df_a
Out[73]:
N0_YLDF N0_DWOC
0   11.79   4
1   7.86    4
2   5.78    3
3   5.35    3
4   6.32    4
5   11.79   4
6   6.89    3
7   10.74   4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM