简体   繁体   English

合并/加入有条件的熊猫数据框

[英]Merge/Join pandas dataframe with condition

I have a two pandas DataFrame df1 and df2 .我有两个熊猫 DataFrame df1df2 The relationship between them is one-to-many, and in some instances it can be one-to-one.它们之间的关系是一对多的,在某些情况下可以是一对一的。 When the relationship is one-to-many, I'd like to join columns with certain conditions.当关系是一对多时,我想加入具有某些条件的列。 I'll illustrate with some data.我会用一些数据来说明。

import pandas as pd

df1 = pd.DataFrame({
                    'vid': [1, 2, 3, 4, 5],
                    'lid': [6, 7, 8, 9, 10],
                    'v': [3, 5, 6, 1, 9]
                  })

df2 = pd.DataFrame({
                    'lid': [6, 6, 8, 8, 10],
                    'av': ['$10','$5','$4','$3','$2'],
                    'cr': [0.04, 0.05, 0.03, 0.04, 0.01]
                  })

For rows where there are multiple joins in df2 ie lid 6 and 8 , I'd like to apply some function say, get the max of av and cr .对于df2中有多个连接的行,即lid 68 ,我想应用一些函数,比如获取avcrmax

Expected output:预期输出:

vid lid  v  av      cr
1    6   3  $10     0.05
2    7   5  np.nan  np.nan
3    8   6  $5      0.04
4    9   1  np.nan  np.nan
5    10  9  $2      0.01

For match by max or by min by both columns create helper column tmp and join new DataFrame created by sorting per columns lid and tmp with remove duplicates per lid :对于两列的最大匹配或最小匹配,创建帮助列tmp并加入通过对每个列lidtmp进行排序创建的新 DataFrame ,并删除每个lid的重复项:

df2['tmp'] = list(zip(df2['av'].str.strip('$').astype(int), df2['cr']))

#sorting by ascending and desceding for match by maximal of tuple in col tmp
df = (df1.merge(df2.sort_values(['lid','tmp'], ascending=[True, False])
                   .drop_duplicates('lid'), how='left', on='lid')
                   .drop('tmp', axis=1))
print (df)
   vid  lid  v   av    cr
0    1    6  3  $10  0.04
1    2    7  5  NaN   NaN
2    3    8  6   $4  0.03
3    4    9  1  NaN   NaN
4    5   10  9   $2  0.01

df2['tmp'] = list(zip(df2['av'].str.strip('$').astype(int), df2['cr']))

#sorting both ascending for match by minimal of tuple in col tmp
df = (df1.merge(df2.sort_values(['lid','tmp'])
                   .drop_duplicates('lid'), how='left', on='lid')
                   .drop('tmp', axis=1))
print (df)
   vid  lid  v   av    cr
0    1    6  3   $5  0.05
1    2    7  5  NaN   NaN
2    3    8  6   $3  0.04
3    4    9  1  NaN   NaN
4    5   10  9   $2  0.01

EDIT: If aggregate max or mean aggregation working for each column separately, so ouput is different like solutions above:编辑:如果聚合maxmean聚合分别为每一列工作,那么输出与上面的解决方案不同:

df2['tmp'] = df2['av'].str.strip('$').astype(int)

df = df1.merge(df2.groupby('lid').max(), how='left', on='lid')
print (df)
   vid  lid  v   av    cr   tmp
0    1    6  3   $5  0.05  10.0
1    2    7  5  NaN   NaN   NaN
2    3    8  6   $4  0.04   4.0
3    4    9  1  NaN   NaN   NaN
4    5   10  9   $2  0.01   2.0

df = df1.merge(df2.groupby('lid').mean(), how='left', on='lid')
print (df)
   vid  lid  v     cr  tmp
0    1    6  3  0.045  7.5
1    2    7  5    NaN  NaN
2    3    8  6  0.035  3.5
3    4    9  1    NaN  NaN
4    5   10  9  0.010  2.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM