繁体   English   中英

将 2D DataFrame 变成系列的最快方法

[英]Fastest way to turn a 2D DataFrame into a Series

我有一个大的 DataFrame 包含符号和日期的股票回报。 像这样的东西:

      2018-10-06  2018-11-17  2018-12-29  ...  2020-09-19  2020-10-31  2020-12-12
BIOL      -15.33      -22.05       84.85  ...      -10.37       11.20      274.15
SRDX      -11.67      -16.84       12.06  ...       -4.66        4.43       17.36
LPTH       -2.65      -19.02        2.68  ...       -1.63       21.58       32.08
VHI        -4.91       -8.50       55.96  ...       -4.18       25.68        0.12
THMO       21.21      -41.98       30.01  ...      -33.89        2.99       39.29

我需要将其转换为单个 DataFrame 列,每行仅包含一个数据点。 像这样:

(2018-10-06 00:00:00, BIOL)  -15.33
(2018-10-06 00:00:00, SRDX)  -11.67
(2018-10-06 00:00:00, LPTH)   -2.65
(2018-10-06 00:00:00, VHI)    -4.91
(2018-10-06 00:00:00, THMO)   21.21
...                             ...
(2020-12-12 00:00:00, BIOL)  274.15
(2020-12-12 00:00:00, SRDX)   17.36
(2020-12-12 00:00:00, LPTH)   32.08
(2020-12-12 00:00:00, VHI)     0.12
(2020-12-12 00:00:00, THMO)   39.29

我的代码有效,但速度很慢。 最快的方法是什么?

示例代码:

import pandas as pd
from pandas import Timestamp

df = pd.DataFrame.from_dict({
    Timestamp('2018-10-06 00:00:00')   : {
        'BIOL': -15.33, 'SRDX': -11.67, 'LPTH': -2.65, 'VHI': -4.91, 'THMO': 21.21
    }, Timestamp('2018-11-17 00:00:00'): {
        'BIOL': -22.05, 'SRDX': -16.84, 'LPTH': -19.02, 'VHI': -8.5, 'THMO': -41.98
    }, Timestamp('2018-12-29 00:00:00'): {
        'BIOL': 84.85, 'SRDX': 12.06, 'LPTH': 2.68, 'VHI': 55.96, 'THMO': 30.01
    }, Timestamp('2019-02-09 00:00:00'): {
        'BIOL': 31.15, 'SRDX': -22.09, 'LPTH': -0.65, 'VHI': -23.89, 'THMO': -13.54
    }, Timestamp('2019-03-23 00:00:00'): {
        'BIOL': -11.25, 'SRDX': 8.56, 'LPTH': 1.97, 'VHI': 5.26, 'THMO': -12.0
    }, Timestamp('2019-05-04 00:00:00'): {
        'BIOL': -26.29, 'SRDX': -8.73, 'LPTH': -40.7, 'VHI': -6.99, 'THMO': 5.68
    }, Timestamp('2019-06-15 00:00:00'): {
        'BIOL': -2.55, 'SRDX': -2.47, 'LPTH': -17.32, 'VHI': -5.88, 'THMO': 3.58
    }, Timestamp('2019-07-27 00:00:00'): {
        'BIOL': -37.91, 'SRDX': 11.61, 'LPTH': -1.32, 'VHI': 1.98, 'THMO': 41.87
    }, Timestamp('2019-09-07 00:00:00'): {
        'BIOL': -27.24, 'SRDX': 0.45, 'LPTH': -3.47, 'VHI': -4.29, 'THMO': 29.51
    }, Timestamp('2019-10-19 00:00:00'): {
        'BIOL': -20.43, 'SRDX': -8.95, 'LPTH': -24.03, 'VHI': -7.5, 'THMO': -39.17
    }, Timestamp('2019-11-30 00:00:00'): {
        'BIOL': 5.47, 'SRDX': -0.74, 'LPTH': 20.0, 'VHI': -4.89, 'THMO': 21.98
    }, Timestamp('2020-01-11 00:00:00'): {
        'BIOL': 24.12, 'SRDX': -9.33, 'LPTH': 110.61, 'VHI': -14.29, 'THMO': 15.74
    }, Timestamp('2020-02-22 00:00:00'): {
        'BIOL': -68.06, 'SRDX': -8.63, 'LPTH': -25.18, 'VHI': -28.96, 'THMO': -19.74
    }, Timestamp('2020-04-04 00:00:00'): {
        'BIOL': 65.43, 'SRDX': 5.53, 'LPTH': 106.73, 'VHI': -20.26, 'THMO': 105.19
    }, Timestamp('2020-05-16 00:00:00'): {
        'BIOL': 25.47, 'SRDX': 22.79, 'LPTH': 40.93, 'VHI': 2.14, 'THMO': -27.56
    }, Timestamp('2020-06-27 00:00:00'): {
        'BIOL': -7.96, 'SRDX': 8.95, 'LPTH': -3.63, 'VHI': 7.95, 'THMO': 17.65
    }, Timestamp('2020-08-08 00:00:00'): {
        'BIOL': -32.86, 'SRDX': -18.31, 'LPTH': -16.1, 'VHI': 31.41, 'THMO': -48.59
    }, Timestamp('2020-09-19 00:00:00'): {
        'BIOL': -10.37, 'SRDX': -4.66, 'LPTH': -1.63, 'VHI': -4.18, 'THMO': -33.89
    }, Timestamp('2020-10-31 00:00:00'): {
        'BIOL': 11.2, 'SRDX': 4.43, 'LPTH': 21.58, 'VHI': 25.68, 'THMO': 2.99
    }, Timestamp('2020-12-12 00:00:00'): {
        'BIOL': 274.15, 'SRDX': 17.36, 'LPTH': 32.08, 'VHI': 0.12, 'THMO': 39.29
    }
})

print(df)

d = {}
for i, col in df.iteritems():
    d.update({(name, date): pct
              for name, date, pct in zip([col.name] * len(col), col.index, col)})
df2 = pd.DataFrame.from_dict(d, orient='index')

print(df2)

DataFrame.unstack用于具有第一级日期时间的 MultIndex 系列:

s = df.unstack()
print (s)

2018-10-06  BIOL    -15.33
            SRDX    -11.67
            LPTH     -2.65
            VHI      -4.91
            THMO     21.21
 
2020-12-12  BIOL    274.15
            SRDX     17.36
            LPTH     32.08
            VHI       0.12
            THMO     39.29
Length: 100, dtype: float64

或者,如果需要二级日期时间,请使用DataFrame.stack

s = df.stack()
print (s)
BIOL  2018-10-06   -15.33
      2018-11-17   -22.05
      2018-12-29    84.85
      2019-02-09    31.15
      2019-03-23   -11.25
 
THMO  2020-06-27    17.65
      2020-08-08   -48.59
      2020-09-19   -33.89
      2020-10-31     2.99
      2020-12-12    39.29
Length: 100, dtype: float64

Numpy alternative with numpy.ravel , numpy.tile , numpy.repeat with Series constructor with MultiIndex.from_arrays :

c = np.tile(df.columns, len(df))
r = np.repeat(df.index, len(df.columns))
v = np.ravel(df, order='F')

s = pd.Series(v, index=pd.MultiIndex.from_arrays([r, c]))
print (s)
BIOL  2018-10-06    -15.33
      2018-11-17    -11.67
      2018-12-29     -2.65
      2019-02-09     -4.91
      2019-03-23     21.21
 
THMO  2020-06-27    274.15
      2020-08-08     17.36
      2020-09-19     32.08
      2020-10-31      0.12
      2020-12-12     39.29
Length: 100, dtype: float64

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM