[英]Numpy, Pandas: what is the fastest way to calculate dataset row value basing on previous N values?
I have a dataset and I want to enrich it. 我有一个数据集,我想丰富它。 I need to calculate some new dataset column which is some function of previous N rows of another column.
我需要计算一些新的数据集列,该列是另一列的前N行的某些函数。
As an example, given I want to calculate binary column which shows if current day temperature is higher than average in previous N days. 例如,给定我要计算二进制列,该列显示当前日温度是否高于前N天的平均温度。
At the moment I just iterate through all the pandas dataset values using df.iterrows() and do appropriate calculations. 目前,我只是使用df.iterrows()遍历所有熊猫数据集值并进行适当的计算。 This takes some time.
这需要一些时间。 Is there any better option?
有没有更好的选择?
use rolling/moving window functions . 使用滚动/移动窗口功能 。
Sample DF: 样本DF:
In [46]: df = pd.DataFrame({'date':pd.date_range('2000-01-01', freq='D', periods=15), 'temp':np.random.rand(15)*20})
In [47]: df
Out[47]:
date temp
0 2000-01-01 17.246616
1 2000-01-02 18.228468
2 2000-01-03 6.245991
3 2000-01-04 8.890069
4 2000-01-05 6.837285
5 2000-01-06 1.555924
6 2000-01-07 18.641918
7 2000-01-08 6.308174
8 2000-01-09 13.601203
9 2000-01-10 6.482098
10 2000-01-11 15.711497
11 2000-01-12 18.690925
12 2000-01-13 2.493110
13 2000-01-14 17.626622
14 2000-01-15 6.982129
Solution: 解:
In [48]: df['higher_3avg'] = df.rolling(3)['temp'].mean().diff().gt(0)
In [49]: df
Out[49]:
date temp higher_3avg
0 2000-01-01 17.246616 False
1 2000-01-02 18.228468 False
2 2000-01-03 6.245991 False
3 2000-01-04 8.890069 False
4 2000-01-05 6.837285 False
5 2000-01-06 1.555924 False
6 2000-01-07 18.641918 True
7 2000-01-08 6.308174 False
8 2000-01-09 13.601203 True
9 2000-01-10 6.482098 False
10 2000-01-11 15.711497 True
11 2000-01-12 18.690925 True
12 2000-01-13 2.493110 False
13 2000-01-14 17.626622 True
14 2000-01-15 6.982129 False
Explanation: 说明:
In [50]: df.rolling(3)['temp'].mean()
Out[50]:
0 NaN
1 NaN
2 13.907025
3 11.121509
4 7.324448
5 5.761093
6 9.011709
7 8.835339
8 12.850431
9 8.797158
10 11.931599
11 13.628173
12 12.298511
13 12.936886
14 9.033954
Name: temp, dtype: float64
for huge data, Numpy solutions are 30x faster. 对于海量数据,Numpy解决方案的速度提高了30倍。 from Here :
从这里 :
def moving_average(a, n=3) :
ret = a.cumsum()
ret[n:] -= ret[:-n]
return ret[n - 1:] / n
In [419]: %timeit moving_average(df.values)
38.2 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [420]: %timeit df.rolling(3).mean()
1.42 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.