简体   繁体   中英

how to merge/concat/join 2 dataframes with a non-unique multi-index to reconcile the content?

I have 2 below dataframes from 2 sources, the 3 white columns are indexes. These are from 2 reports about historical trades. the trades can only be compared when 3 columns "Trade date" "Exchange Instrument" and "Prompt date" are the same. "Trade date" is because they were report in chronological order. and the futures contracts are only the same when "Exchange Instrument" and "Prompt date" are the same.

df1

df2

I simply want to merge the 2 dfs so that there will be 7 columns with the same 3 indexes. it maybe challenging that the values are not unique for the same index on both reports: for example for the 1st August, CMX Cu contract with prompt 2020-03-01, there have been 3 trades and different prices:

I tried concat and merge, but never get a desired df.. for example while trying

df_complete= pd.concat([df_ctrm_timelined, df_broker_timelined],axis=1) 

I get

ValueError: cannot handle a non-unique multi-index!

if you need the raw data, these are the first 10 rows of both dfs, the 2 dfs all have the same number of rows

df_broker_timelined[:10]

                                           broker_Lots  broker_TradePrice  \
Trade Date Exchange Instrument PromptDate                                   
2019-08-01 CMX Cu              2020-03-01           -1             2.6840   
                               2020-03-01           -1             2.6865   
                               2020-03-01           -2             2.6870   
                               2019-09-01            1             2.6640   
                               2019-09-01            1             2.6665   
                               2019-09-01            2             2.6670   
           LME Al              2019-10-16            6          1777.5000   
                               2019-11-01           -3          1779.0000   
                               2019-11-01           -1          1779.0000   
                               2019-11-01           -2          1779.0000  



                                           broker_Quantity  
Trade Date Exchange Instrument PromptDate                   
2019-08-01 CMX Cu              2020-03-01           -25000  
                               2020-03-01           -25000  
                               2020-03-01           -50000  
                               2019-09-01            25000  
                               2019-09-01            25000  
                               2019-09-01            50000  
           LME Al              2019-10-16              150  
                               2019-11-01              -75  
                               2019-11-01              -25  
                               2019-11-01              -50  

df_ctrm_timelined[:10]

                                           ctrm_TradePrice  ctrm_Lots  \
Trade Date Exchange Instrument PromptDate                               
2019-08-01 CMX Cu              2019-09-30           2.6640          1   
                               2019-09-30           2.6665          1   
                               2019-09-30           2.6670          2   
                               2020-03-31           2.6840         -1   
                               2020-03-31           2.6865         -1   
                               2020-03-31           2.6870         -2   
           LME Al              2019-10-16        1777.5000          6   
                               2019-11-01        1792.5000          3   
                               2019-11-01        1792.5000          3   
                               2019-11-01        1781.5000         -6   

                                           ctrm_Quantity    Strategy  
Trade Date Exchange Instrument PromptDate                             
2019-08-01 CMX Cu              2019-09-30          25000  Strategy 1  
                               2019-09-30          25000  Strategy 1  
                               2019-09-30          50000  Strategy 1  
                               2020-03-31         -25000  Strategy 1  
                               2020-03-31         -25000  Strategy 1  
                               2020-03-31         -50000  Strategy 1  
           LME Al              2019-10-16            150  Strategy 2  
                               2019-11-01             75  Strategy 2  
                               2019-11-01             75  Strategy 2  
                               2019-11-01           -150  Strategy 2 

Dealing with non-unique indexing:

from seaborn import load_dataset

#Create one dataframe with unique indexes, set multiindex
df = load_dataset('tips')
df = df.set_index(['day', 'time', 'sex', 'smoker'])
#Create a unique label per inner most index
df = df.set_index(df.groupby(level=[0,1,2,3]).cumcount(), append=True)

#Create second dataframe 
df2 = df * 2

#Use join 
df.join(df2, lsuffix='_2')

Output:

                              total_bill_2  tip_2  size_2  total_bill    tip  size
day  time   sex    smoker                                                         
Sun  Dinner Female No     0          16.99   1.01       2       33.98   2.02     4
            Male   No     0          10.34   1.66       3       20.68   3.32     6
                          1          21.01   3.50       3       42.02   7.00     6
                          2          23.68   3.31       2       47.36   6.62     4
            Female No     1          24.59   3.61       4       49.18   7.22     8
...                                    ...    ...     ...         ...    ...   ...
Sat  Dinner Male   No     30         29.03   5.92       3       58.06  11.84     6
            Female Yes    14         27.18   2.00       2       54.36   4.00     4
            Male   Yes    26         22.67   2.00       2       45.34   4.00     4
                   No     31         17.82   1.75       2       35.64   3.50     4
Thur Dinner Female No     0          18.78   3.00       2       37.56   6.00     4

[244 rows x 6 columns]

Here's an example using the "tips" dataset:

from seaborn import load_dataset

#Create one dataframe with unique indexes, set multiindex
df = load_dataset('tips')
df = df.set_index(['day', 'time', 'sex', 'smoker'])
df = df.groupby(level=[0,1,2,3]).first().dropna(how='all')

#Create second dataframe 
df2 = df * 2

#Use join 
df.join(df2, lsuffix='_2')

Output:

                           total_bill_2  tip_2  size_2  total_bill    tip  size
day  time   sex    smoker                                                      
Thur Lunch  Male   Yes            19.44   3.00     2.0       38.88   6.00   4.0
                   No             27.20   4.00     4.0       54.40   8.00   8.0
            Female Yes            19.81   4.19     2.0       39.62   8.38   4.0
                   No             10.07   1.83     1.0       20.14   3.66   2.0
     Dinner Female No             18.78   3.00     2.0       37.56   6.00   4.0
Fri  Lunch  Male   Yes            12.16   2.20     2.0       24.32   4.40   4.0
            Female Yes            13.42   3.48     2.0       26.84   6.96   4.0
                   No             15.98   3.00     3.0       31.96   6.00   6.0
     Dinner Male   Yes            28.97   3.00     2.0       57.94   6.00   4.0
                   No             22.49   3.50     2.0       44.98   7.00   4.0
            Female Yes             5.75   1.00     2.0       11.50   2.00   4.0
                   No             22.75   3.25     2.0       45.50   6.50   4.0
Sat  Dinner Male   Yes            38.01   3.00     4.0       76.02   6.00   8.0
                   No             20.65   3.35     3.0       41.30   6.70   6.0
            Female Yes             3.07   1.00     1.0        6.14   2.00   2.0
                   No             20.29   2.75     2.0       40.58   5.50   4.0
Sun  Dinner Male   Yes             7.25   5.15     2.0       14.50  10.30   4.0
                   No             10.34   1.66     3.0       20.68   3.32   6.0
            Female Yes            17.51   3.00     2.0       35.02   6.00   4.0
                   No             16.99   1.01     2.0       33.98   2.02   4.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM