合并多个列上的两个数据框，但仅在两个列都不是 NaN 时才合并列

Question

I'm looking to merge two dataframes across multiple columns but with some additional conditions.我希望跨多列合并两个数据框，但有一些额外的条件。

import pandas as pd
df1 = pd.DataFrame({
    'col1': ['a','b','c', 'd'],
    'optional_col2': ['X',None,'Z','V'],
    'optional_col3': [None,'def', 'ghi','jkl']
})

df2 = pd.DataFrame({
    'col1': ['a','b','c', 'd'],
    'optional_col2': ['X','Y','Z','W'],
    'optional_col3': ['abc', 'def', 'ghi','mno']
})

I would like to always join on col1 but then try to also join on optional_col2 and optional_col3 .我想总是加入col1但然后尝试也加入optional_col2和optional_col3 。 In df1 , the value can be NaN for both columns but it is always populated in df2 .在df1 ，两列的值都可以是NaN ，但它始终填充在df2 。 I would like the join to be valid when the col1 + one of optional_col2 or optional_col3 match.当col1 + optional_col2或optional_col3匹配时，我希望连接有效。

This would result in ['a', 'b', 'c'] joining due to exact col2 , col3 , and exact match, respectively.这将分别导致['a', 'b', 'c']由于精确col2 、 col3和精确匹配而加入。

In SQL I suppose you could write the join as this, if it helps explain further:在 SQL 中，我想你可以这样写连接，如果它有助于进一步解释：

select
    *
from
    df1
        inner join
    df2
        on df1.col1 = df2.col2
        AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)

I've messed around with pd.merge but can't figure how to do a complex operation like this.我弄乱了pd.merge但不知道如何进行这样的复杂操作。 I think I can do a merge on ['col1', 'optional_col2'] then a second merge on ['col1', 'optional_col_3'] then union and drop duplicates?我想我可以做上的合并['col1', 'optional_col2']然后在第二合流['col1', 'optional_col_3']那么工会和删除重复？

Expected DataFrame would be something like:预期的 DataFrame 将类似于：

merged_df = pd.DataFrame({
    'col1': ['a', 'b', 'c'],
    'optional_col_2': ['X', 'Y', 'Z'],
    'optional_col_3': ['abc', 'def', 'ghi']
})

Answer 1

I think you can achieve what you want by filling in the NaN s of columns in df1 with values from df2 before joining, ie我认为您可以通过在加入之前使用 df2 中的值填充 df1 中列的NaN来实现您想要的，即

df1["optional_col2"] = df1["optional_col2"].fillna(df2["optional_col2"])
df1["optional_col3"] = df1["optional_col3"].fillna(df2["optional_col3"])

pd.merge(df1, df2, on=["col1", "optional_col2", "optional_col3"])

This gives your expected answer of这给出了您预期的答案

  col1 optional_col2 optional_col3
0    a             X           abc
1    b             Y           def
2    c             Z           ghi

合并多个列上的两个数据框，但仅在两个列都不是 NaN 时才合并列

问题描述

1 个解决方案

解决方案1
0 2021-10-21 21:57:54

合并多个列上的两个数据框，但仅在两个列都不是 NaN 时才合并列

问题描述

1 个解决方案

解决方案1 0 2021-10-21 21:57:54

解决方案1
0 2021-10-21 21:57:54