简体   繁体   English

合并数据框而不复制列

[英]Merging dataframes without duplicating columns

I have 3 dataframes with different structures, where one contains the 2 keys to link with the other two ones:我有 3 个具有不同结构的数据帧,其中一个包含与其他两个链接的 2 个键:

df1 = id1  id2    df2 = id1  a   b1  c1  c2    df3 = id2 a   b1   b2   c1
      1                 1    1a  1b1 1c1 1c2         11  11a 11b1 11b2 11c1
           11           2    2a  2b1 2c1 2c2         12  12a 12b1 12b2 12c1
           12           3    3a  3b1 3c1 3c2         13  13a 13b1 13b2 13c1
           13                                        14  14a 14b1 14b2 14c1
      2                                              21  21a 21b1 21b2 21c1
           21                                        22  22a 22b1 22b2 22c1
           22                                        23  23a 23b1 23b2 23c1
                                                     31  31a 31b1 31b2 31c1

Then I merge df1 with df2 :然后我将df1df2合并:

df1 = pd.merge(df1, df2, on='id1', how='left')

df1 = id1  id2  a   b1  c1  c2
      1         1a  1b1 1c1 1c2
           11   nan nan nan nan
           12   nan nan nan nan
           13   nan nan nan nan
      2         2a  2b1 2c1 2c2
           21   nan nan nan nan
           22   nan nan nan nan

But when I merge with df3 I have:但是当我与df3合并时,我有:

df1 = pd.merge(df1, df3, on='id2', how='left')

df1 = id1  id2   a_x  b1_x  c1_x  c2   a_y  b1_y  b2   c1_y  
      1          1a   1b1   1c1   1c2
           11    nan  nan   nan   nan  11a  11b1  11b2 11c1
           12    nan  nan   nan   nan  12a  12b1  12b2 12c1
           13    nan  nan   nan   nan  13a  13b1  13b2 13c1
      2          2a   2b1   2c1   2c2
           21    nan  nan   nan   nan  21a  21b1  21b2 21c1
           22    nan  nan   nan   nan  22a  22b1  22b2 22c1

In a nutshell, when there are overlaping columns between the dataframes being merged, the method creates a new column with the sulfixes.简而言之,当合并的数据帧之间存在重叠列时,该方法会创建一个带有亚硫基的新列。 However, I want the values to be replaced when they are coincidents columns.但是,我希望在它们是重合列时替换这些值。

What I'm trying to get is this:我想要得到的是:

df1 = id1  id2   a    b1    c1    c2   b2   
      1          1a   1b1   1c1   1c2
           11    11a  11b1  11c1       11b2
           12    12a  12b1  12c1       12b2
           13    13a  13b1  13c1       13b2
      2          2a   2b1   2c1   2c2
           21    21a  21b1  21c1       21b2
           22    22a  22b1  22c1       22b2

I also tried to fillna('') before merging the second time, but I have the same result.我也尝试在第二次合并之前fillna('') ,但我有相同的结果。

try like below尝试如下

df1 = pd.merge(df1, df3, on='id2', how='left')
df1['a']=df1['a_y'].fillna(df1['a_x'])
df1['b']=df1['b_y'].fillna(df1['b_x'])
df1['c1']=df1['c1_y'].fillna(df1['c1_x'])

This is a surprisingly difficult problem in pandas.这在大熊猫中是一个令人惊讶的难题。 I've been trying to deal with it as well.我也一直在努力处理它。 One option is to create a separate dataframe for each individual merge, and then concat them together.一种选择是为每个单独的合并创建一个单独的数据框,然后将它们连接在一起。 I don't think that's too "workaround-y":我不认为这太“解决方法-y”:

df_m1 = pd.merge(df1, df2, on='id1', how='inner')  # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df1 = pd.concat([df_m1, df_m2])

However, there will be one problem: if there were some rows in df1 that couldn't be merged with df2 or df3 that you wanted to keep, they won't have stayed in the example above.但是,会有一个问题:如果df1中的某些行无法与您想要保留的df2df3合并,则它们将不会保留在上面的示例中。 You'll have to manually add them.您必须手动添加它们。 At this point, it would be great if you could just manually add the rows with indexes that aren't in df_m1 or df_m2 , but the problem is merging doesn't conserve the indexes (see: here ), which really complicates this even further.在这一点上,如果您可以手动添加带有不在df_m1df_m2索引的行, df_m1 df_m2 ,但问题是合并不会保存索引(请参阅: 此处),这确实使这进一步复杂化.

So you could modify the above to:因此,您可以将上述内容修改为:

df_m1 = pd.merge(df1, df2, on='id1', how='inner')  # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df1 = pd.concat([df_m1, df_m2, df1[~df1.id1.isin(df2.id1) & ~df1.id2.isin(df3.id2)])

It would be nice if there were a better way to do the last part.如果有更好的方法来完成最后一部分,那就太好了。 This above is loopable if you need to merge an arbitrary number of dataframes too.如果您也需要合并任意数量的数据帧,则上述内容是可循环的。


EDIT: Alternatively, since in the general case, when you want to merge more than 3 dataframes, it will help to do the last part with indexes, you can do the following:编辑:或者,因为在一般情况下,当您想要合并 3 个以上的数据帧时,使用索引完成最后一部分会有所帮助,您可以执行以下操作:

df1['old_index'] = df1.index  # this will let you keep the index
df_m1 = pd.merge(df1, df2, on='id1', how='inner')  # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df_other = df1[~df1.old_index.isin(pd.concat([df_m1, df_m2]).old_index)]


df1 = pd.concat([df_m1, df_m2, df_other])

This would be much easier to put in a loop.这将更容易放入循环中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM