简体   繁体   中英

Merging dataframes without duplicating columns

I have 3 dataframes with different structures, where one contains the 2 keys to link with the other two ones:

df1 = id1  id2    df2 = id1  a   b1  c1  c2    df3 = id2 a   b1   b2   c1
      1                 1    1a  1b1 1c1 1c2         11  11a 11b1 11b2 11c1
           11           2    2a  2b1 2c1 2c2         12  12a 12b1 12b2 12c1
           12           3    3a  3b1 3c1 3c2         13  13a 13b1 13b2 13c1
           13                                        14  14a 14b1 14b2 14c1
      2                                              21  21a 21b1 21b2 21c1
           21                                        22  22a 22b1 22b2 22c1
           22                                        23  23a 23b1 23b2 23c1
                                                     31  31a 31b1 31b2 31c1

Then I merge df1 with df2 :

df1 = pd.merge(df1, df2, on='id1', how='left')

df1 = id1  id2  a   b1  c1  c2
      1         1a  1b1 1c1 1c2
           11   nan nan nan nan
           12   nan nan nan nan
           13   nan nan nan nan
      2         2a  2b1 2c1 2c2
           21   nan nan nan nan
           22   nan nan nan nan

But when I merge with df3 I have:

df1 = pd.merge(df1, df3, on='id2', how='left')

df1 = id1  id2   a_x  b1_x  c1_x  c2   a_y  b1_y  b2   c1_y  
      1          1a   1b1   1c1   1c2
           11    nan  nan   nan   nan  11a  11b1  11b2 11c1
           12    nan  nan   nan   nan  12a  12b1  12b2 12c1
           13    nan  nan   nan   nan  13a  13b1  13b2 13c1
      2          2a   2b1   2c1   2c2
           21    nan  nan   nan   nan  21a  21b1  21b2 21c1
           22    nan  nan   nan   nan  22a  22b1  22b2 22c1

In a nutshell, when there are overlaping columns between the dataframes being merged, the method creates a new column with the sulfixes. However, I want the values to be replaced when they are coincidents columns.

What I'm trying to get is this:

df1 = id1  id2   a    b1    c1    c2   b2   
      1          1a   1b1   1c1   1c2
           11    11a  11b1  11c1       11b2
           12    12a  12b1  12c1       12b2
           13    13a  13b1  13c1       13b2
      2          2a   2b1   2c1   2c2
           21    21a  21b1  21c1       21b2
           22    22a  22b1  22c1       22b2

I also tried to fillna('') before merging the second time, but I have the same result.

try like below

df1 = pd.merge(df1, df3, on='id2', how='left')
df1['a']=df1['a_y'].fillna(df1['a_x'])
df1['b']=df1['b_y'].fillna(df1['b_x'])
df1['c1']=df1['c1_y'].fillna(df1['c1_x'])

This is a surprisingly difficult problem in pandas. I've been trying to deal with it as well. One option is to create a separate dataframe for each individual merge, and then concat them together. I don't think that's too "workaround-y":

df_m1 = pd.merge(df1, df2, on='id1', how='inner')  # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df1 = pd.concat([df_m1, df_m2])

However, there will be one problem: if there were some rows in df1 that couldn't be merged with df2 or df3 that you wanted to keep, they won't have stayed in the example above. You'll have to manually add them. At this point, it would be great if you could just manually add the rows with indexes that aren't in df_m1 or df_m2 , but the problem is merging doesn't conserve the indexes (see: here ), which really complicates this even further.

So you could modify the above to:

df_m1 = pd.merge(df1, df2, on='id1', how='inner')  # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df1 = pd.concat([df_m1, df_m2, df1[~df1.id1.isin(df2.id1) & ~df1.id2.isin(df3.id2)])

It would be nice if there were a better way to do the last part. This above is loopable if you need to merge an arbitrary number of dataframes too.


EDIT: Alternatively, since in the general case, when you want to merge more than 3 dataframes, it will help to do the last part with indexes, you can do the following:

df1['old_index'] = df1.index  # this will let you keep the index
df_m1 = pd.merge(df1, df2, on='id1', how='inner')  # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df_other = df1[~df1.old_index.isin(pd.concat([df_m1, df_m2]).old_index)]


df1 = pd.concat([df_m1, df_m2, df_other])

This would be much easier to put in a loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM