简体   繁体   English

如何通过pandas在复杂条件下合并2个数据集

[英]how to merge 2 datasets under complex conditions by pandas

I am trying to merge two datasets by using pandas. 我正在尝试使用pandas合并两个数据集。

this is master dataset: 这是主数据集:

id  num1   num2
0   5      8
1   2      9
2   8      7
3   9      6

this is another one: 这是另一个:

id2  num1_min  num1_max  num2_min  num2_max
0    1         3         8         10       
1    3         6         6         10
2    7         9         6         9

the output that I expect: 我期望的输出:

id  num1   num2  id2
0   5      8     1
1   2      9     0  
2   8      7     2
3   9      6     2

I want to give id2 to the master. 我想把id2给主人。 At that time, I want to left join them under the condition that the value of num1 is between num1_min / max and the value of num2 is between num2_min / max. 那时,我想在num1的值在num1_min / max之间并且num2的值在num2_min / max之间的条件下left join它们。 Master dataset has only one id2 or null. 主数据集只有一个id2或null。 So id2 will not be joined in duplicate. 所以id2不会一式两份加入。

Please advise me. 请建议我。

It is possible to accomplish this using boolean masking. 可以使用布尔掩码来完成此操作。 ie find the index id2 in df that satisfy the join condition. 即在df中找到满足连接条件的索引id2。

In [1]: import pandas as pd
In [2]: df
Out[2]: 
   id  num1  num2
0   0     5     8
1   1     2     9
2   2     8     7
3   3     9     6

In [3]: df1
Out[3]: 
   id2  num1_min  num1_max  num2_min  num2_max
0    0         1         3         8        10
1    1         3         6         6        10
2    2         7         9         6         9

#find id2 based on conditions
In [4]: df['id2'] = df.apply(lambda row: (((row['num1'] >= df1['num1_min']) &
                           (row['num1'] <= df1['num1_max'])) &
                          ((row['num2'] >= df1['num2_min']) &
                           (row['num2'] <= df1['num2_max']))).idxmax(), axis=1)

In [5]: df
Out[5]: 
   id  num1  num2  id2
0   0     5     8    1
1   1     2     9    0
2   2     8     7    2
3   3     9     6    2

above, I used apply to go through df rows, check each row against the condition then find the index in df1 satisfying the condition. 上面,我使用apply来遍历df行,根据条件检查每一行,然后在满足条件的df1中找到索引。

EDIT 编辑

Another way to find id2 找到id2的另一种方法

df['id2'] = df.apply(lambda row: df1.loc[(((row['num1'] >= df1['num1_min']) &
                                           (row['num1'] <= df1['num1_max'])) &
                                          ((row['num2'] >= df1['num2_min']) &
                                           (row['num2'] <= df1['num2_max']))),
                                         'id2'].values[0], axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM