Pandas - 基于 cumprod 的多个新列和其他列中的条件

Question

我有一个 dataframe df有两列df["Period"]和df["Return"] 。 df["Period"]的数字为 1、2、3 ... n，并且在增加。 我想使用df["Return"]的.cumprod计算新列，其中df["Period"] >= 1、2、3 等。请注意，每个唯一周期的行数不同且不系统。

所以我得到了n新专栏

df["M_1] ：是 df["Return"] 的 cumprod 行df["Period"] >= 1
df["M_2] ：是 df["Return"] 的 cumprod 行df["Period"] >= 2
...

在我正在工作的例子下面。 该实现有两个缺点：

对于大量的独特时期来说它非常慢
它不适用于 pandas 方法链接

任何有关如何加速和/或对其进行矢量化的提示都值得赞赏

import numpy as np 
import pandas as pd

# Create sample data
n = 10
data = {"Period": np.sort(np.random.randint(1,5,n)),
        "Returns": np.random.randn(n)/100, }
df = pd.DataFrame(data)

# Slow implementation
periods = set(df["Period"])
for period in periods:
    cumret = (1 + df.query("Period >= @period")["Returns"]).cumprod() - 1
    df[f"M_{month}"] = cumret
df.head()

这是预期的 output：

	时期	退货	M_1	M_2	M_3	M_4
0	1个	-0.0268917	-0.0268917	楠	楠	楠
1个	1个	0.018205	-0.00917625	楠	楠	楠
2个	2个	0.00505662	-0.00416604	0.00505662	楠	楠
3个	2个	-8.28544e-05	-0.00424855	0.00497334	楠	楠
4个	2个	0.00127519	-0.00297878	0.00625488	楠	楠
5个	3个	-0.00224315	-0.00521524	0.0039977	-0.00224315	楠
6个	3个	-0.0197291	-0.0248414	-0.0158103	-0.021928	楠
7	3个	0.00136592	-0.0235094	-0.0144659	-0.020592	楠
8个	4个	0.00582897	-0.0178175	-0.00872129	-0.0148831	0.00582897
9	4个	0.00260425	-0.0152597	-0.00613975	-0.0123176	0.0084484

Answer 1

以下是 10,000 次迭代后您的代码在我的机器（Python 3.10.7、Pandas 1.4.3）上的平均执行情况：

import statistics
import time

import numpy as np
import pandas as pd

elapsed_time = []
for _ in range(10_000):
    start_time = time.time()

    periods = set(df["Period"])
    for period in periods:
        cumret = (1 + df.query("Period >= @period")["Returns"]).cumprod() - 1
        df[f"M_{period}"] = cumret

    elapsed_time.append(time.time() - start_time)

print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)

Output：

--- 0.00298 seconds in average ---

   Period   Returns       M_1       M_2       M_4
0       1 -0.008427 -0.008427       NaN       NaN
1       1  0.019699  0.011106       NaN       NaN
2       2  0.012661  0.023908  0.012661       NaN
3       2 -0.005059  0.018728  0.007538       NaN
4       4  0.025452  0.044657  0.033182  0.025452
5       4  0.010808  0.055948  0.044349  0.036535
6       4  0.004843  0.061062  0.049407  0.041555
7       4  0.005791  0.067207  0.055484  0.047587
8       4 -0.001816  0.065269  0.053568  0.045685
9       4  0.014102  0.080291  0.068425  0.060431

通过一些小的修改，您可以获得约 3 倍的速度提升：

elapsed_time = []
for _ in range(10_000):
    start_time = time.time()

    for period in df["Period"].unique():
        df[f"M_{period}"] = (
            1 + df.loc[df["Period"].ge(period), "Returns"]
        ).cumprod() - 1

    elapsed_time.append(time.time() - start_time)

print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)

Output：

--- 0.001052 seconds in average ---

   Period   Returns       M_1       M_2       M_4
0       1 -0.008427 -0.008427       NaN       NaN
1       1  0.019699  0.011106       NaN       NaN
2       2  0.012661  0.023908  0.012661       NaN
3       2 -0.005059  0.018728  0.007538       NaN
4       4  0.025452  0.044657  0.033182  0.025452
5       4  0.010808  0.055948  0.044349  0.036535
6       4  0.004843  0.061062  0.049407  0.041555
7       4  0.005791  0.067207  0.055484  0.047587
8       4 -0.001816  0.065269  0.053568  0.045685
9       4  0.014102  0.080291  0.068425  0.060431

Pandas - 基于 cumprod 的多个新列和其他列中的条件

问题描述

1 个解决方案

解决方案1
1 2022-10-09 13:26:51

Pandas - 基于 cumprod 的多个新列和其他列中的条件

问题描述

1 个解决方案

解决方案1 1 2022-10-09 13:26:51

解决方案1
1 2022-10-09 13:26:51