简体   繁体   English

Numpy,Pandas:基于先前的N值计算数据集行值的最快方法是什么?

[英]Numpy, Pandas: what is the fastest way to calculate dataset row value basing on previous N values?

I have a dataset and I want to enrich it. 我有一个数据集,我想丰富它。 I need to calculate some new dataset column which is some function of previous N rows of another column. 我需要计算一些新的数据集列,该列是另一列的前N行的某些函数。

As an example, given I want to calculate binary column which shows if current day temperature is higher than average in previous N days. 例如,给定我要计算二进制列,该列显示当前日温度是否高于前N天的平均温度。

At the moment I just iterate through all the pandas dataset values using df.iterrows() and do appropriate calculations. 目前,我只是使用df.iterrows()遍历所有熊猫数据集值并进行适当的计算。 This takes some time. 这需要一些时间。 Is there any better option? 有没有更好的选择?

use rolling/moving window functions . 使用滚动/移动窗口功能

Sample DF: 样本DF:

In [46]: df = pd.DataFrame({'date':pd.date_range('2000-01-01', freq='D', periods=15), 'temp':np.random.rand(15)*20})

In [47]: df
Out[47]:
         date       temp
0  2000-01-01  17.246616
1  2000-01-02  18.228468
2  2000-01-03   6.245991
3  2000-01-04   8.890069
4  2000-01-05   6.837285
5  2000-01-06   1.555924
6  2000-01-07  18.641918
7  2000-01-08   6.308174
8  2000-01-09  13.601203
9  2000-01-10   6.482098
10 2000-01-11  15.711497
11 2000-01-12  18.690925
12 2000-01-13   2.493110
13 2000-01-14  17.626622
14 2000-01-15   6.982129

Solution: 解:

In [48]: df['higher_3avg'] = df.rolling(3)['temp'].mean().diff().gt(0)

In [49]: df
Out[49]:
         date       temp  higher_3avg
0  2000-01-01  17.246616        False
1  2000-01-02  18.228468        False
2  2000-01-03   6.245991        False
3  2000-01-04   8.890069        False
4  2000-01-05   6.837285        False
5  2000-01-06   1.555924        False
6  2000-01-07  18.641918         True
7  2000-01-08   6.308174        False
8  2000-01-09  13.601203         True
9  2000-01-10   6.482098        False
10 2000-01-11  15.711497         True
11 2000-01-12  18.690925         True
12 2000-01-13   2.493110        False
13 2000-01-14  17.626622         True
14 2000-01-15   6.982129        False

Explanation: 说明:

In [50]: df.rolling(3)['temp'].mean()
Out[50]:
0           NaN
1           NaN
2     13.907025
3     11.121509
4      7.324448
5      5.761093
6      9.011709
7      8.835339
8     12.850431
9      8.797158
10    11.931599
11    13.628173
12    12.298511
13    12.936886
14     9.033954
Name: temp, dtype: float64

for huge data, Numpy solutions are 30x faster. 对于海量数据,Numpy解决方案的速度提高了30倍。 from Here : 这里

def moving_average(a, n=3) :
    ret = a.cumsum()
    ret[n:]  -= ret[:-n]
    return ret[n - 1:] / n

In [419]: %timeit moving_average(df.values)
38.2 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [420]: %timeit df.rolling(3).mean()
1.42 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在熊猫中计算和添加列的最快方法是什么? - What is the fastest way to calculate and add a column in pandas? Pandas 中有没有办法使用前一行值来计算一行的新值 - is there a way in Pandas to use previous row value to compute new values for a row 熊猫计算行值作为同一行和上一行中先前值的函数 - pandas calculate row values as function of previous values in same and previous row 在 Pandas 中计算的最快方法? - Fastest way to calculate in Pandas? 给定一个 NumPy 数组和多对一映射数组,计算聚合映射值的最快方法是什么 - Given a NumPy array and a many to one mapping array, what is the fastest way to calculate the aggregated mapped values 最快的方法来比较pandas数据帧中的行和上一行以及数百万行 - Fastest way to compare row and previous row in pandas dataframe with millions of rows 使用 Pandas 获取数据集中的上一行值 - Get previous row value in dataset with Pandas Pandas数据框架如何根据特定组和上一行值为列分配值 - Pandas dataframe how to assign value to a column basing on a specific group and previous row value 基于前一行的值,pandas dataframe计算行值的更快方法 - faster way to calculate row values based on values of previous rows, pandas dataframe 熊猫-根据先前计算的行值计算行值 - Pandas - calculate row value based on previous calculated row value
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM