繁体   English   中英

Pandas - 基于 cumprod 的多个新列和其他列中的条件

[英]Pandas - multiple new columns based on cumprod and condition in other column

我有一个 dataframe df有两列df["Period"]df["Return"] df["Period"]的数字为 1、2、3 ... n,并且在增加。 我想使用df["Return"].cumprod计算新列,其中df["Period"] >= 1、2、3 等。请注意,每个唯一周期的行数不同且不系统。

所以我得到了n新专栏

  • df["M_1] :是 df["Return"] 的 cumprod 行df["Period"] >= 1
  • df["M_2] :是 df["Return"] 的 cumprod 行df["Period"] >= 2
  • ...

在我正在工作的例子下面。 该实现有两个缺点:

  1. 对于大量的独特时期来说它非常慢
  2. 它不适用于 pandas 方法链接

任何有关如何加速和/或对其进行矢量化的提示都值得赞赏

import numpy as np 
import pandas as pd

# Create sample data
n = 10
data = {"Period": np.sort(np.random.randint(1,5,n)),
        "Returns": np.random.randn(n)/100, }
df = pd.DataFrame(data)

# Slow implementation
periods = set(df["Period"])
for period in periods:
    cumret = (1 + df.query("Period >= @period")["Returns"]).cumprod() - 1
    df[f"M_{month}"] = cumret
df.head()

这是预期的 output:

时期 退货 M_1 M_2 M_3 M_4
0 1个 -0.0268917 -0.0268917
1个 1个 0.018205 -0.00917625
2个 2个 0.00505662 -0.00416604 0.00505662
3个 2个 -8.28544e-05 -0.00424855 0.00497334
4个 2个 0.00127519 -0.00297878 0.00625488
5个 3个 -0.00224315 -0.00521524 0.0039977 -0.00224315
6个 3个 -0.0197291 -0.0248414 -0.0158103 -0.021928
7 3个 0.00136592 -0.0235094 -0.0144659 -0.020592
8个 4个 0.00582897 -0.0178175 -0.00872129 -0.0148831 0.00582897
9 4个 0.00260425 -0.0152597 -0.00613975 -0.0123176 0.0084484

以下是 10,000 次迭代后您的代码在我的机器(Python 3.10.7、Pandas 1.4.3)上的平均执行情况:

import statistics
import time

import numpy as np
import pandas as pd

elapsed_time = []
for _ in range(10_000):
    start_time = time.time()

    periods = set(df["Period"])
    for period in periods:
        cumret = (1 + df.query("Period >= @period")["Returns"]).cumprod() - 1
        df[f"M_{period}"] = cumret

    elapsed_time.append(time.time() - start_time)

print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)

Output:

--- 0.00298 seconds in average ---

   Period   Returns       M_1       M_2       M_4
0       1 -0.008427 -0.008427       NaN       NaN
1       1  0.019699  0.011106       NaN       NaN
2       2  0.012661  0.023908  0.012661       NaN
3       2 -0.005059  0.018728  0.007538       NaN
4       4  0.025452  0.044657  0.033182  0.025452
5       4  0.010808  0.055948  0.044349  0.036535
6       4  0.004843  0.061062  0.049407  0.041555
7       4  0.005791  0.067207  0.055484  0.047587
8       4 -0.001816  0.065269  0.053568  0.045685
9       4  0.014102  0.080291  0.068425  0.060431

通过一些小的修改,您可以获得约 3 倍的速度提升

elapsed_time = []
for _ in range(10_000):
    start_time = time.time()

    for period in df["Period"].unique():
        df[f"M_{period}"] = (
            1 + df.loc[df["Period"].ge(period), "Returns"]
        ).cumprod() - 1

    elapsed_time.append(time.time() - start_time)

print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)

Output:

--- 0.001052 seconds in average ---

   Period   Returns       M_1       M_2       M_4
0       1 -0.008427 -0.008427       NaN       NaN
1       1  0.019699  0.011106       NaN       NaN
2       2  0.012661  0.023908  0.012661       NaN
3       2 -0.005059  0.018728  0.007538       NaN
4       4  0.025452  0.044657  0.033182  0.025452
5       4  0.010808  0.055948  0.044349  0.036535
6       4  0.004843  0.061062  0.049407  0.041555
7       4  0.005791  0.067207  0.055484  0.047587
8       4 -0.001816  0.065269  0.053568  0.045685
9       4  0.014102  0.080291  0.068425  0.060431

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM