简体   繁体   中英

Pandas concat list of dataframes and add identifier efficiently

Say i have a dataframe of stock prices with multi index, that i want to normalize into one dataframe, but with an extra column indicating which stock it belongs to. eg:

在此处输入图像描述

import pandas as pd
from pandas import Timestamp
tickers = ['TSLA','AAPL','MSFT']

# data below tuncated for example
new_data = pd.DataFrame({('AAPL',
  'Open'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 128.7292938232422, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 127.69000244140625, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 127.54000091552734, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 128.31570434570312, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 128.7292938232422},
 ('AAPL',
  'High'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 128.74000549316406, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 127.69999694824219, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 127.845703125, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 128.4199981689453, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 128.92999267578125},
 ('AAPL',
  'Low'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 128.6199951171875, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 127.48999786376953, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 127.52999877929688, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 128.30999755859375, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 128.60000610351562},
 ('AAPL',
  'Close'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 128.67999267578125, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 127.54000091552734, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 127.80999755859375, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 128.37989807128906, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 128.875},
 ('AAPL',
  'Adj Close'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 128.67999267578125, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 127.54000091552734, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 127.80999755859375, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 128.37989807128906, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 128.875},
 ('AAPL',
  'Volume'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 354071, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 1278221, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 1156859, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 383722, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 1283176},
 ('TSLA',
  'Open'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 842.9199829101562, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 824.4307861328125, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 820.8300170898438, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 834.3262939453125, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 844.3800048828125},
 ('TSLA',
  'High'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 844.5, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 824.919921875, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 824.9169921875, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 835.7000122070312, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 845.8494262695312},
 ('TSLA',
  'Low'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 842.0, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 819.0999755859375, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 820.5632934570312, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 832.0999755859375, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 842.1199951171875},
 ('TSLA',
  'Close'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 842.7081298828125, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 820.8599853515625, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 823.3134765625, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 835.1749877929688, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 843.85009765625},
 ('TSLA',
  'Adj Close'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 842.7081298828125, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 820.8599853515625, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 823.3134765625, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 835.1749877929688, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 843.85009765625},
 ('TSLA',
  'Volume'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 290117, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 900962, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 482207, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 331725, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 485683},
 ('MSFT',
  'Open'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 213.8800048828125, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 213.39999389648438, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 213.14999389648438, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 213.07000732421875, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 213.55999755859375},
 ('MSFT',
  'High'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 213.89500427246094, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 213.4499969482422, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 213.74000549316406, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 213.5500030517578, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 213.8800048828125},
 ('MSFT',
  'Low'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 213.69009399414062, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 213.11900329589844, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 213.14999389648438, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 213.0399932861328, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 213.5},
 ('MSFT',
  'Close'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 213.77999877929688, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 213.1699981689453, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 213.69000244140625, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 213.49000549316406, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 213.80499267578125},
 ('MSFT',
  'Adj Close'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 213.77999877929688, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 213.1699981689453, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 213.69000244140625, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 213.49000549316406, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 213.80499267578125},
 ('MSFT',
  'Volume'): {Timestamp('2021-01-15 13:40:00-0500', tz='America/New_York'): 80589, Timestamp('2021-01-15 15:05:00-0500', tz='America/New_York'): 147625, Timestamp('2021-01-15 15:10:00-0500', tz='America/New_York'): 294613, Timestamp('2021-01-15 13:05:00-0500', tz='America/New_York'): 151667, Timestamp('2021-01-15 11:20:00-0500', tz='America/New_York'): 284443}})
In [10]: new_data.columns
Out[10]: 
MultiIndex([('AAPL',      'Open'),
            ('AAPL',      'High'),
            ('AAPL',       'Low'),
            ('AAPL',     'Close'),
            ('AAPL', 'Adj Close'),
            ('AAPL',    'Volume'),
            ('TSLA',      'Open'),
            ('TSLA',      'High'),
            ('TSLA',       'Low'),
            ('TSLA',     'Close'),
            ('TSLA', 'Adj Close'),
            ('TSLA',    'Volume'),
            ('MSFT',      'Open'),
            ('MSFT',      'High'),
            ('MSFT',       'Low'),
            ('MSFT',     'Close'),
            ('MSFT', 'Adj Close'),
            ('MSFT',    'Volume')],
           )

Intended Result

在此处输入图像描述

Code for this example below.

combined = pd.read_json('{"Datetime":{"0":1610726400000,"1":1610726700000,"2":1610727000000,"3":1610727300000,"4":1610727600000},"Open":{"0":842.3516845703,"1":837.1900024414,"2":838.5949707031,"3":840.8345947266,"4":844.3800048828},"High":{"0":843.8900146484,"1":839.7600097656,"2":841.7399902344,"3":844.8800048828,"4":845.8494262695},"Low":{"0":836.5900268555,"1":835.0,"2":838.2700195312,"3":839.3001098633,"4":842.1199951172},"Close":{"0":837.1259765625,"1":838.5020141602,"2":841.0,"3":844.25,"4":843.8500976562},"Adj Close":{"0":837.1259765625,"1":838.5020141602,"2":841.0,"3":844.25,"4":843.8500976562},"Volume":{"0":0,"1":636383,"2":401164,"3":514418,"4":485683},"ticker":{"0":"TSLA","1":"TSLA","2":"TSLA","3":"TSLA","4":"TSLA"}}')

Current Method

Currently I'm manually looping through another list tickers above, using .get() to get the df with only that tickers columns, then appending to a list and re-concatenating everything as below

temp = []
for ticker in tickers:
    stock_df = new_data.get(ticker)
    if stock_df is not None:
        stock_df['ticker'] = ticker
        temp.append(stock_df.reset_index())
combined = pd.concat(temp, ignore_index=True)

Question - How to Improve?

This is incredibly inefficient, and i know there should be some magic pandas command that can do this at one go without loops. reset_index() has not worked for me..

Thank you!

You can use .stack on level=0 to reshape the dataframe, then .sort and rename the index:

new_data[tickers].stack(level=0).sort_index(level=[1, 0])\
                 .rename_axis(['Datetime', 'Ticker']).reset_index()

                    Datetime Ticker   Adj Close       Close        High         Low        Open   Volume
0  2021-01-15 11:20:00-05:00   AAPL  128.875000  128.875000  128.929993  128.600006  128.729294  1283176
1  2021-01-15 13:05:00-05:00   AAPL  128.379898  128.379898  128.419998  128.309998  128.315704   383722
2  2021-01-15 13:40:00-05:00   AAPL  128.679993  128.679993  128.740005  128.619995  128.729294   354071
3  2021-01-15 15:05:00-05:00   AAPL  127.540001  127.540001  127.699997  127.489998  127.690002  1278221
4  2021-01-15 15:10:00-05:00   AAPL  127.809998  127.809998  127.845703  127.529999  127.540001  1156859
5  2021-01-15 11:20:00-05:00   MSFT  213.804993  213.804993  213.880005  213.500000  213.559998   284443
6  2021-01-15 13:05:00-05:00   MSFT  213.490005  213.490005  213.550003  213.039993  213.070007   151667
7  2021-01-15 13:40:00-05:00   MSFT  213.779999  213.779999  213.895004  213.690094  213.880005    80589
8  2021-01-15 15:05:00-05:00   MSFT  213.169998  213.169998  213.449997  213.119003  213.399994   147625
9  2021-01-15 15:10:00-05:00   MSFT  213.690002  213.690002  213.740005  213.149994  213.149994   294613
10 2021-01-15 11:20:00-05:00   TSLA  843.850098  843.850098  845.849426  842.119995  844.380005   485683
11 2021-01-15 13:05:00-05:00   TSLA  835.174988  835.174988  835.700012  832.099976  834.326294   331725
12 2021-01-15 13:40:00-05:00   TSLA  842.708130  842.708130  844.500000  842.000000  842.919983   290117
13 2021-01-15 15:05:00-05:00   TSLA  820.859985  820.859985  824.919922  819.099976  824.430786   900962
14 2021-01-15 15:10:00-05:00   TSLA  823.313477  823.313477  824.916992  820.563293  820.830017   482207

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM