[英]Merge two dataframes on multiple columns but only merge on columns if both not NaN
I'm looking to merge two dataframes across multiple columns but with some additional conditions.我希望跨多列合并两个数据框,但有一些额外的条件。
import pandas as pd
df1 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X',None,'Z','V'],
'optional_col3': [None,'def', 'ghi','jkl']
})
df2 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X','Y','Z','W'],
'optional_col3': ['abc', 'def', 'ghi','mno']
})
I would like to always join on col1
but then try to also join on optional_col2
and optional_col3
.我想总是加入
col1
但然后尝试也加入optional_col2
和optional_col3
。 In df1
, the value can be NaN
for both columns but it is always populated in df2
.在
df1
,两列的值都可以是NaN
,但它始终填充在df2
。 I would like the join to be valid when the col1
+ one of optional_col2
or optional_col3
match.当
col1
+ optional_col2
或optional_col3
匹配时,我希望连接有效。
This would result in ['a', 'b', 'c']
joining due to exact col2
, col3
, and exact match, respectively.这将分别导致
['a', 'b', 'c']
由于精确col2
、 col3
和精确匹配而加入。
In SQL I suppose you could write the join as this, if it helps explain further:在 SQL 中,我想你可以这样写连接,如果它有助于进一步解释:
select
*
from
df1
inner join
df2
on df1.col1 = df2.col2
AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)
I've messed around with pd.merge
but can't figure how to do a complex operation like this.我弄乱了
pd.merge
但不知道如何进行这样的复杂操作。 I think I can do a merge on ['col1', 'optional_col2']
then a second merge on ['col1', 'optional_col_3']
then union and drop duplicates?我想我可以做上的合并
['col1', 'optional_col2']
然后在第二合流['col1', 'optional_col_3']
那么工会和删除重复?
Expected DataFrame would be something like:预期的 DataFrame 将类似于:
merged_df = pd.DataFrame({
'col1': ['a', 'b', 'c'],
'optional_col_2': ['X', 'Y', 'Z'],
'optional_col_3': ['abc', 'def', 'ghi']
})
I think you can achieve what you want by filling in the NaN
s of columns in df1 with values from df2 before joining, ie我认为您可以通过在加入之前使用 df2 中的值填充 df1 中列的
NaN
来实现您想要的,即
df1["optional_col2"] = df1["optional_col2"].fillna(df2["optional_col2"])
df1["optional_col3"] = df1["optional_col3"].fillna(df2["optional_col3"])
pd.merge(df1, df2, on=["col1", "optional_col2", "optional_col3"])
This gives your expected answer of这给出了您预期的答案
col1 optional_col2 optional_col3
0 a X abc
1 b Y def
2 c Z ghi
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.