简体   繁体   English

合并多个列上的两个数据框,但仅在两个列都不是 NaN 时才合并列

[英]Merge two dataframes on multiple columns but only merge on columns if both not NaN

I'm looking to merge two dataframes across multiple columns but with some additional conditions.我希望跨多列合并两个数据框,但有一些额外的条件。

import pandas as pd
df1 = pd.DataFrame({
    'col1': ['a','b','c', 'd'],
    'optional_col2': ['X',None,'Z','V'],
    'optional_col3': [None,'def', 'ghi','jkl']
})

df2 = pd.DataFrame({
    'col1': ['a','b','c', 'd'],
    'optional_col2': ['X','Y','Z','W'],
    'optional_col3': ['abc', 'def', 'ghi','mno']
})

I would like to always join on col1 but then try to also join on optional_col2 and optional_col3 .我想总是加入col1但然后尝试也加入optional_col2optional_col3 In df1 , the value can be NaN for both columns but it is always populated in df2 .df1 ,两列的值都可以是NaN ,但它始终填充在df2 I would like the join to be valid when the col1 + one of optional_col2 or optional_col3 match.col1 + optional_col2optional_col3匹配时,我希望连接有效。

This would result in ['a', 'b', 'c'] joining due to exact col2 , col3 , and exact match, respectively.这将分别导致['a', 'b', 'c']由于精确col2col3和精确匹配而加入。

In SQL I suppose you could write the join as this, if it helps explain further:在 SQL 中,我想你可以这样写连接,如果它有助于进一步解释:

select
    *
from
    df1
        inner join
    df2
        on df1.col1 = df2.col2
        AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)

I've messed around with pd.merge but can't figure how to do a complex operation like this.我弄乱了pd.merge但不知道如何进行这样的复杂操作。 I think I can do a merge on ['col1', 'optional_col2'] then a second merge on ['col1', 'optional_col_3'] then union and drop duplicates?我想我可以做上的合并['col1', 'optional_col2']然后在第二合流['col1', 'optional_col_3']那么工会和删除重复?

Expected DataFrame would be something like:预期的 DataFrame 将类似于:

merged_df = pd.DataFrame({
    'col1': ['a', 'b', 'c'],
    'optional_col_2': ['X', 'Y', 'Z'],
    'optional_col_3': ['abc', 'def', 'ghi']
})

I think you can achieve what you want by filling in the NaN s of columns in df1 with values from df2 before joining, ie我认为您可以通过在加入之前使用 df2 中的值填充 df1 中列的NaN来实现您想要的,即

df1["optional_col2"] = df1["optional_col2"].fillna(df2["optional_col2"])
df1["optional_col3"] = df1["optional_col3"].fillna(df2["optional_col3"])

pd.merge(df1, df2, on=["col1", "optional_col2", "optional_col3"])

This gives your expected answer of这给出了您预期的答案

  col1 optional_col2 optional_col3
0    a             X           abc
1    b             Y           def
2    c             Z           ghi

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM