简体   繁体   中英

Merging two dataframes with hierarchical columns

this is my first time using the multi-indexing in pandas and I need some help to merge two dataframes with hierarchical columns. Here are my two dataframes:

col_index = pd.MultiIndex.from_product([['a', 'b', 'c'], ['w', 'x']])
df1 = pd.DataFrame(np.ones([4,6]),columns=col_index, index=range(4))

     a         b         c     
     w    x    w    x    w    x
0  1.0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0  1.0  1.0

df2 = pd.DataFrame(np.zeros([2,6]),columns=col_index, index=range(2))

     a         b         c     
     w    x    w    x    w    x
0  0.0  0.0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0  0.0  0.0

When I use the merge method, I get the following result:

pd.merge(df1,df2, how='left', suffixes=('', '_2'), left_index = True, right_index= True ))

     a         b         c       a_2       b_2       c_2     
     w    x    w    x    w    x    w    x    w    x    w    x
0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
1  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
2  1.0  1.0  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN  NaN  NaN
3  1.0  1.0  1.0  1.0  1.0  1.0  NaN  NaN  NaN  NaN  NaN  NaN

But I would like to merge the two dataframes on a lower level with the suffixes taking effect on ['w', 'x'] like in the following :

     a                   b                   c               
     w  w_2    x  x_2    w  w_2    x  x_2    w  w_2    x  x_2
0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
1  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
2  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN
3  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN

You can use join or merge with swaplevel() or reorder_levels . Then use .sort_index() and pass axis=1 to sort by index columns.

  • .join() is better when you are doing a merge on the index like this.
  • .swaplevel() is better when there are two levels (as in this case) while .reorder_levels() is better for 3 or more levels.

Below are the 4 combinations of these methods. For this specific example, I think .join() / .swaplevel() is the most pandonic (see final example):

df3 = (df1.reorder_levels([1,0],axis=1)
       .join(df2.reorder_levels([1,0],axis=1), rsuffix='_2')
       .reorder_levels([1,0],axis=1).sort_index(axis=1, level=[0, 1]))
df3
Out[1]: 
     a                   b                   c               
     w  w_2    x  x_2    w  w_2    x  x_2    w  w_2    x  x_2
0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
1  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
2  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN
3  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN

df3 = (pd.merge(df1.reorder_levels([1,0],axis=1),
                df2.reorder_levels([1,0],axis=1),
                how='left', left_index=True, right_index=True, suffixes = ('', '_2'))
                .reorder_levels([1,0],axis=1).sort_index(axis=1, level=[0, 1]))
df3
Out[2]: 
     a                   b                   c               
     w  w_2    x  x_2    w  w_2    x  x_2    w  w_2    x  x_2
0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
1  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
2  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN
3  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN

df3 = (pd.merge(df1.swaplevel(axis=1),
                df2.swaplevel(axis=1),
                how='left', left_index=True, right_index=True, suffixes = ('', '_2'))
                .swaplevel(axis=1).sort_index(axis=1, level=[0, 1]))
df3
Out[3]: 
     a                   b                   c               
     w  w_2    x  x_2    w  w_2    x  x_2    w  w_2    x  x_2
0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
1  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
2  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN
3  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN

df3 = (df1.swaplevel(i=0,j=1, axis=1)
       .join(df2.swaplevel(axis=1), rsuffix='_2')
       .swaplevel(axis=1).sort_index(axis=1, level=[0, 1]))
df3
Out[4]: 
     a                   b                   c               
     w  w_2    x  x_2    w  w_2    x  x_2    w  w_2    x  x_2
0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
1  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0
2  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN
3  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN  1.0  NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM