[英]Pandas create a new column with complex condition of multiple other columns
[英]Pandas - multiple new columns based on cumprod and condition in other column
我有一个 dataframe df
有两列df["Period"]
和df["Return"]
。 df["Period"]
的数字为 1、2、3 ... n,并且在增加。 我想使用df["Return"]
的.cumprod
计算新列,其中df["Period"]
>= 1、2、3 等。请注意,每个唯一周期的行数不同且不系统。
所以我得到了n
新专栏
df["M_1]
:是 df["Return"] 的 cumprod 行df["Period"]
>= 1df["M_2]
:是 df["Return"] 的 cumprod 行df["Period"]
>= 2在我正在工作的例子下面。 该实现有两个缺点:
任何有关如何加速和/或对其进行矢量化的提示都值得赞赏
import numpy as np
import pandas as pd
# Create sample data
n = 10
data = {"Period": np.sort(np.random.randint(1,5,n)),
"Returns": np.random.randn(n)/100, }
df = pd.DataFrame(data)
# Slow implementation
periods = set(df["Period"])
for period in periods:
cumret = (1 + df.query("Period >= @period")["Returns"]).cumprod() - 1
df[f"M_{month}"] = cumret
df.head()
这是预期的 output:
时期 | 退货 | M_1 | M_2 | M_3 | M_4 | |
---|---|---|---|---|---|---|
0 | 1个 | -0.0268917 | -0.0268917 | 楠 | 楠 | 楠 |
1个 | 1个 | 0.018205 | -0.00917625 | 楠 | 楠 | 楠 |
2个 | 2个 | 0.00505662 | -0.00416604 | 0.00505662 | 楠 | 楠 |
3个 | 2个 | -8.28544e-05 | -0.00424855 | 0.00497334 | 楠 | 楠 |
4个 | 2个 | 0.00127519 | -0.00297878 | 0.00625488 | 楠 | 楠 |
5个 | 3个 | -0.00224315 | -0.00521524 | 0.0039977 | -0.00224315 | 楠 |
6个 | 3个 | -0.0197291 | -0.0248414 | -0.0158103 | -0.021928 | 楠 |
7 | 3个 | 0.00136592 | -0.0235094 | -0.0144659 | -0.020592 | 楠 |
8个 | 4个 | 0.00582897 | -0.0178175 | -0.00872129 | -0.0148831 | 0.00582897 |
9 | 4个 | 0.00260425 | -0.0152597 | -0.00613975 | -0.0123176 | 0.0084484 |
以下是 10,000 次迭代后您的代码在我的机器(Python 3.10.7、Pandas 1.4.3)上的平均执行情况:
import statistics
import time
import numpy as np
import pandas as pd
elapsed_time = []
for _ in range(10_000):
start_time = time.time()
periods = set(df["Period"])
for period in periods:
cumret = (1 + df.query("Period >= @period")["Returns"]).cumprod() - 1
df[f"M_{period}"] = cumret
elapsed_time.append(time.time() - start_time)
print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)
Output:
--- 0.00298 seconds in average ---
Period Returns M_1 M_2 M_4
0 1 -0.008427 -0.008427 NaN NaN
1 1 0.019699 0.011106 NaN NaN
2 2 0.012661 0.023908 0.012661 NaN
3 2 -0.005059 0.018728 0.007538 NaN
4 4 0.025452 0.044657 0.033182 0.025452
5 4 0.010808 0.055948 0.044349 0.036535
6 4 0.004843 0.061062 0.049407 0.041555
7 4 0.005791 0.067207 0.055484 0.047587
8 4 -0.001816 0.065269 0.053568 0.045685
9 4 0.014102 0.080291 0.068425 0.060431
通过一些小的修改,您可以获得约 3 倍的速度提升:
elapsed_time = []
for _ in range(10_000):
start_time = time.time()
for period in df["Period"].unique():
df[f"M_{period}"] = (
1 + df.loc[df["Period"].ge(period), "Returns"]
).cumprod() - 1
elapsed_time.append(time.time() - start_time)
print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)
Output:
--- 0.001052 seconds in average ---
Period Returns M_1 M_2 M_4
0 1 -0.008427 -0.008427 NaN NaN
1 1 0.019699 0.011106 NaN NaN
2 2 0.012661 0.023908 0.012661 NaN
3 2 -0.005059 0.018728 0.007538 NaN
4 4 0.025452 0.044657 0.033182 0.025452
5 4 0.010808 0.055948 0.044349 0.036535
6 4 0.004843 0.061062 0.049407 0.041555
7 4 0.005791 0.067207 0.055484 0.047587
8 4 -0.001816 0.065269 0.053568 0.045685
9 4 0.014102 0.080291 0.068425 0.060431
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.