Merge two dataframes( both have multi-index)

Question

I have a problem similar to Merge two dataframes with multi-index 。

in:

import pandas as pd
import numpy as np
row_x1 = ['a1','b1','c1']
row_x2 = ['a2','b2','c2']
row_x3 = ['a3','b3','c3']
row_x4 = ['a4','b4','c4']
index_arrays = [np.array(['first', 'first', 'second', 'second']), np.array(['one','two','one','two'])]
df1 = pd.DataFrame([row_x1,row_x2,row_x3,row_x4], columns=list('ABC'), index=index_arrays)
print(df1)

out:

             A   B   C
first  one  a1  b1  c1
       two  a2  b2  c2
second one  a3  b3  c3
       two  a4  b4  c4

in:

row_y1 = ['d1','e1','f1']
row_y2 = ['d2','e2','f2']
row_y3 = ['d3','e3','f3']
index_arrays = [np.array(['first','first', 'second',]), np.array(['one','three','two'])]
df2 = pd.DataFrame([row_y1,row_y2,row_y3], columns=list('DEF'), index=index_arrays)
print(df2)

out:

               D   E   F
first  one    d1  e1  f1
       three  d2  e2  f2
second two    d3  e3  f3

in other words, how can I merge them to achieve df3 (as follows)?

in:

row_x1 = ['a1','b1','c1']
row_x2 = ['a2','b2','c2']
row_x3 = ['a3','b3','c3']
row_x4 = ['a4','b4','c4']
row_y1 = ['d1','e1','f1']
row_y2 = ['d2','e2','f2']
row_y3 = ['d3','e3','f3']

row_z1 = row_x1 + row_y1
row_z2 = row_x2 + [np.nan, np.nan, np.nan]
row_z3 = [np.nan, np.nan, np.nan] + row_y2
row_z4 = row_x3 + [np.nan, np.nan, np.nan]
row_z5 = row_x4 + row_y3
index_arrays = [np.array(['first', 'first', 'first', 'second', 'second']), np.array(['one','two','three','one','two'])]
df3 = pd.DataFrame([row_z1,row_z2,row_z3,row_z4,row_z5], columns=list('ABCDEF'), index=index_arrays)
print(df3)

out:

                A    B    C    D    E    F
first  one     a1   b1   c1   d1   e1   f1
       two     a2   b2   c2  NaN  NaN  NaN
       three  NaN  NaN  NaN   d2   e2   f2
second one     a3   b3   c3  NaN  NaN  NaN
       two     a4   b4   c4   d3   e3   f3

PS. thanks @Andreuccio for his/her question!

thanks @Ajay Verma and @EBDS. that is indeed solutions for manually created df data. But I am very confused about the following situation:

I have two dataframe from statistics. Then I copied the corresponding data for pd.merge()

in:

df1 = data1[data1.index.get_level_values(0) == 'BASIC_GZAG_TMB'].copy()

out:

                         0       1       2       3
BASIC_GZAG_TMB 1     127.0   179.0   190.0   239.0
               2      38.0    23.0    21.0    29.0
               3      37.0    27.0    32.0    37.0
               4       5.0    14.0    11.0    23.0
               5      31.0    56.0    41.0    65.0
               7     389.0   258.0   337.0   243.0
               NaN  1323.0  1388.0  1307.0  1311.0

in:

df2 = data2[data2.index.get_level_values(0) == 'BASIC_GZAG_TMB'].copy()

out:

                         0       1       2       3
BASIC_GZAG_TMB 1     207.0   232.0   252.0   223.0
               2      26.0    18.0    19.0    20.0
               3      43.0    41.0    50.0    42.0
               4      35.0    27.0    37.0    15.0
               5      54.0    52.0    78.0    64.0
               6       1.0  1306.0     1.0     4.0
               7     206.0   263.0   227.0   230.0
               NaN  1374.0  1306.0  1282.0  1348.0

Then I merged df1 and df2 by:

df1.merge(df2, left_index=True, right_index=True, how='outer')

out:

                       0_x     1_x     2_x     3_x     0_y     1_y     2_y  \
BASIC_GZAG_TMB 1     127.0   179.0   190.0   239.0   207.0   232.0   252.0   
               2      38.0    23.0    21.0    29.0    26.0    18.0    19.0   
               3      37.0    27.0    32.0    37.0    43.0    41.0    50.0   
               4       5.0    14.0    11.0    23.0    35.0    27.0    37.0   
               5      31.0    56.0    41.0    65.0    54.0    52.0    78.0   
               7     389.0   258.0   337.0   243.0   206.0   263.0   227.0   
               NaN  1323.0  1388.0  1307.0  1311.0  1374.0  1306.0  1282.0   

                       3_y  
BASIC_GZAG_TMB 1     223.0  
               2      20.0  
               3      42.0  
               4      15.0  
               5      64.0  
               7     230.0  
               NaN  1348.0

I am confused about the index of 6 which exists in df2 disappeared in result.

I know if i use df2.merge(df1...) can be a solution. But in fact, the data1 and data2 ware dynamically generated, I don't know which one has more indexs. I just want to get the union of df1 and df2.

Answer 1

You can use Pandas merge for it. Link to documnetation: link

df = df1.merge(df2, left_index=True, right_index=True, how='outer')
print(df)

Output

                A    B    C    D    E    F
first  one     a1   b1   c1   d1   e1   f1
       three  NaN  NaN  NaN   d2   e2   f2
       two     a2   b2   c2  NaN  NaN  NaN
second one     a3   b3   c3  NaN  NaN  NaN
       two     a4   b4   c4   d3   e3   f3

Answer 2

If you need to sort according to numeric words... one, two, three...

use the pandas merge command
sort the index using key= vectorize parse

Code:

from number_parser import parse
dfx = (
    df1.merge(df2,left_index=True,right_index=True,how='outer')
    .sort_index(key=lambda x: np.vectorize(parse)(x).astype(float)) )

Another example:

You may need to install the number_parse:

!pip install number_parser

Update:

As I dont' have the new data, I use the original data to test the "missing 6". I've also changed the column names to be the same, and added a nan index.

data1 = df1.copy(deep=True)
data2 = df2.copy(deep=True)
df1 = data1[data1.index.get_level_values(0) == 'first'].copy()
df2 = data2[data2.index.get_level_values(0) == 'first'].copy()

dfx = df1.merge(df2, left_index=True, right_index=True, how='outer').sort_index(
        key=lambda x: np.vectorize(parse)(x)
        )

As you can see, it's not missing any of the values. The problem probably does not lie in the merge part and need to inspect the source data which give rise to the situation.

Merge two dataframes( both have multi-index)

Question

2 answers

solution1
0 2021-10-16 13:42:43

solution2
0 2021-10-16 14:14:50

Merge two dataframes( both have multi-index)

Question

2 answers

solution1 0 2021-10-16 13:42:43

solution2 0 2021-10-16 14:14:50

solution1
0 2021-10-16 13:42:43

solution2
0 2021-10-16 14:14:50