简体   繁体   中英

efficiently growing a dataframe in pandas

On an iterative basis, I'm generating a DataFrame that looks like this:

              RIC RICRoot ISIN ExpirationDate                      Exchange           ...            OpenInterest  BlockVolume  TotalVolume2  SecurityDescription  SecurityLongDescription
closingDate                                                                           ...                                                                                                 
2018-03-15   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-16   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-19   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-20   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-21   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None

I turn this into a multi-indexed DF:

tmp.columns = pd.MultiIndex.from_arrays( [ [contract]*len(tmp.columns), tmp.columns.tolist() ] )

Where contract is just the reference name for that data, which you can see in the output below as SPH0 :

    SPH0                                                                     ...                                                                                            
              RIC RICRoot ISIN ExpirationDate                      Exchange           ...           OpenInterest BlockVolume TotalVolume2 SecurityDescription SecurityLongDescription
closingDate                                                                           ...                                                                                            
2018-03-15   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-16   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-19   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-20   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-21   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None

I currently have a very inefficient way of merging these DataFrames:

if df is None:
            df = tmp;
        else:
            df = df.merge( tmp, how='outer', left_index=True, right_index=True)

This is incredibly slow. I want to store all of these tempdf's in an associated mapping style with their respective contract name, and be able to reference their data easily and in a vectorized manner. What is the optimal solution? Does growing horizontally/vertically matter?

IIUC, you can just use pd.concat() and pass your list of dataframes and the keys for you resulting MultiIndex dataframe. Take the following dataframe samples:

import pandas as pd

df1 = pd.DataFrame([                                                                                            
['2018-03-11',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-12',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-15',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-23',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-24',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

df2 = pd.DataFrame([                                                                                            
['2018-03-15',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-16',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-22',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-24',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-20',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

df3 = pd.DataFrame([                                                                                            
['2018-03-15',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-16',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-18',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-20',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-21',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

Now call pd.concat() :

pd.concat([df1, df2, df3], keys=['SPH0','HAB3','UHA6'])

Yields:

       closingDate              ...                                   Exchange
SPH0 0  2018-03-11              ...               CME:Index and Options Market
     1  2018-03-12              ...               CME:Index and Options Market
     2  2018-03-15              ...               CME:Index and Options Market
     3  2018-03-23              ...               CME:Index and Options Market
     4  2018-03-24              ...               CME:Index and Options Market
HAB3 0  2018-03-15              ...               CME:Index and Options Market
     1  2018-03-16              ...               CME:Index and Options Market
     2  2018-03-22              ...               CME:Index and Options Market
     3  2018-03-24              ...               CME:Index and Options Market
     4  2018-03-20              ...               CME:Index and Options Market
UHA6 0  2018-03-15              ...               CME:Index and Options Market
     1  2018-03-16              ...               CME:Index and Options Market
     2  2018-03-18              ...               CME:Index and Options Market
     3  2018-03-20              ...               CME:Index and Options Market
     4  2018-03-21              ...               CME:Index and Options Market

You can also use a list comprehension to create a list of dataframes to pass to pd.concat() , for example:

my_keys = ['SPH0','HAB3','UHA6']
dfs = [create_df(key) for key in my_keys]
pd.concat(dfs, keys=my_keys)

Where the function create_df() returns a dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM