[英]Merge/Join pandas dataframe with condition
I have a two pandas DataFrame df1
and df2
.我有两个熊猫 DataFrame
df1
和df2
。 The relationship between them is one-to-many, and in some instances it can be one-to-one.它们之间的关系是一对多的,在某些情况下可以是一对一的。 When the relationship is one-to-many, I'd like to join columns with certain conditions.
当关系是一对多时,我想加入具有某些条件的列。 I'll illustrate with some data.
我会用一些数据来说明。
import pandas as pd
df1 = pd.DataFrame({
'vid': [1, 2, 3, 4, 5],
'lid': [6, 7, 8, 9, 10],
'v': [3, 5, 6, 1, 9]
})
df2 = pd.DataFrame({
'lid': [6, 6, 8, 8, 10],
'av': ['$10','$5','$4','$3','$2'],
'cr': [0.04, 0.05, 0.03, 0.04, 0.01]
})
For rows where there are multiple joins in df2
ie lid
6
and 8
, I'd like to apply some function say, get the max
of av
and cr
.对于
df2
中有多个连接的行,即lid
6
和8
,我想应用一些函数,比如获取av
和cr
的max
。
Expected output:预期输出:
vid lid v av cr
1 6 3 $10 0.05
2 7 5 np.nan np.nan
3 8 6 $5 0.04
4 9 1 np.nan np.nan
5 10 9 $2 0.01
For match by max or by min by both columns create helper column tmp
and join new DataFrame created by sorting per columns lid
and tmp
with remove duplicates per lid
:对于两列的最大匹配或最小匹配,创建帮助列
tmp
并加入通过对每个列lid
和tmp
进行排序创建的新 DataFrame ,并删除每个lid
的重复项:
df2['tmp'] = list(zip(df2['av'].str.strip('$').astype(int), df2['cr']))
#sorting by ascending and desceding for match by maximal of tuple in col tmp
df = (df1.merge(df2.sort_values(['lid','tmp'], ascending=[True, False])
.drop_duplicates('lid'), how='left', on='lid')
.drop('tmp', axis=1))
print (df)
vid lid v av cr
0 1 6 3 $10 0.04
1 2 7 5 NaN NaN
2 3 8 6 $4 0.03
3 4 9 1 NaN NaN
4 5 10 9 $2 0.01
df2['tmp'] = list(zip(df2['av'].str.strip('$').astype(int), df2['cr']))
#sorting both ascending for match by minimal of tuple in col tmp
df = (df1.merge(df2.sort_values(['lid','tmp'])
.drop_duplicates('lid'), how='left', on='lid')
.drop('tmp', axis=1))
print (df)
vid lid v av cr
0 1 6 3 $5 0.05
1 2 7 5 NaN NaN
2 3 8 6 $3 0.04
3 4 9 1 NaN NaN
4 5 10 9 $2 0.01
EDIT: If aggregate max
or mean
aggregation working for each column separately, so ouput is different like solutions above:编辑:如果聚合
max
或mean
聚合分别为每一列工作,那么输出与上面的解决方案不同:
df2['tmp'] = df2['av'].str.strip('$').astype(int)
df = df1.merge(df2.groupby('lid').max(), how='left', on='lid')
print (df)
vid lid v av cr tmp
0 1 6 3 $5 0.05 10.0
1 2 7 5 NaN NaN NaN
2 3 8 6 $4 0.04 4.0
3 4 9 1 NaN NaN NaN
4 5 10 9 $2 0.01 2.0
df = df1.merge(df2.groupby('lid').mean(), how='left', on='lid')
print (df)
vid lid v cr tmp
0 1 6 3 0.045 7.5
1 2 7 5 NaN NaN
2 3 8 6 0.035 3.5
3 4 9 1 NaN NaN
4 5 10 9 0.010 2.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.