简体   繁体   中英

Fastest way to get rolling averages in pandas?

I have a list of nodes (about 2300 of them) that have hourly price data for about a year. I have a script that, for each node, loops through the times of the day to create a 4-hour trailing average, then groups the averages by month and hour. Finally, these hours in a month are averaged to give, for each month, a typical day of prices. I'm wondering if there is a faster way to do this because what I have seems to take a significant amount of time (about an hour). I also save the dataframes as csv files for later visualization (that's not the slow part).

df (before anything is done to it)
        Price_Node_Name      Local_Datetime_HourEnding   Price      Irrelevant_column

0       My-node                 2016-08-17 01:00:00        20.95          EST
1       My-node                 2016-08-17 02:00:00        21.45          EST
2       My-node                 2016-08-17 03:00:00        25.60          EST

df_node (after the groupby as it looks going to csv)
Month        Hour             MA
1            0                23.55
1            1                23.45
1            2                21.63


 for node in node_names:
     df_node = df[df['Price_Node_Name'] == node]
     df_node['MA'] = df_node['Price'].rolling(4).mean()
     df_node = df_node.groupby([df_node['Local_Datetime_HourEnding'].dt.month, 
                      df_node['Local_Datetime_HourEnding'].dt.hour]).mean()
     df_node.to_csv('%s_rollingavg.csv' % node)

I get an weak error warning me about SetWithCopy, but I haven't quite figured out how to use .loc here since the column ['MA'] doesn't exist until I create it in this snippet and any way I can think of to create it before hand and fill it seems slower than what I have. Could be totally wrong though. Any help would be great.

python 3.6

edit: I might have misread the question here, hopefully this at least sparks some ideas for the solution.

I think it is useful to have the index as the datetime column when working with time series data in Pandas.

Here is some sample data:

Out[3]:
                          price
date
2015-01-14 00:00:00  155.427361
2015-01-14 01:00:00  205.285202
2015-01-14 02:00:00  205.305021
2015-01-14 03:00:00  195.000000
2015-01-14 04:00:00  213.102000
2015-01-14 05:00:00  214.500000
2015-01-14 06:00:00  222.544375
2015-01-14 07:00:00  227.090251
2015-01-14 08:00:00  227.700000
2015-01-14 09:00:00  243.456190

We use Series.rolling to create an MA column, ie we apply the method to the price column, with a two-period window, and call mean on the resulting rolling object:

In [4]: df['MA'] = df.price.rolling(window=2).mean()

In [5]: df
Out[5]:
                          price          MA
date
2015-01-14 00:00:00  155.427361         NaN
2015-01-14 01:00:00  205.285202  180.356281
2015-01-14 02:00:00  205.305021  205.295111
2015-01-14 03:00:00  195.000000  200.152510
2015-01-14 04:00:00  213.102000  204.051000
2015-01-14 05:00:00  214.500000  213.801000
2015-01-14 06:00:00  222.544375  218.522187
2015-01-14 07:00:00  227.090251  224.817313
2015-01-14 08:00:00  227.700000  227.395125
2015-01-14 09:00:00  243.456190  235.578095

And if you want month and hour columns, can extract those from the index:

In [7]: df['month'] = df.index.month  

In [8]: df['hour'] = df.index.hour

In [9]: df
Out[9]:
                          price          MA  month  hour
date
2015-01-14 00:00:00  155.427361         NaN      1     0
2015-01-14 01:00:00  205.285202  180.356281      1     1
2015-01-14 02:00:00  205.305021  205.295111      1     2
2015-01-14 03:00:00  195.000000  200.152510      1     3
2015-01-14 04:00:00  213.102000  204.051000      1     4
2015-01-14 05:00:00  214.500000  213.801000      1     5
2015-01-14 06:00:00  222.544375  218.522187      1     6
2015-01-14 07:00:00  227.090251  224.817313      1     7
2015-01-14 08:00:00  227.700000  227.395125      1     8
2015-01-14 09:00:00  243.456190  235.578095      1     9

Then we can use groupby :

In [11]: df.groupby([
    ...:     df['month'],
    ...:     df['hour']
    ...: ]).mean()[['MA']]
Out[11]:
                    MA
month hour
1     0            NaN
      1     180.356281
      2     205.295111
      3     200.152510
      4     204.051000
      5     213.801000
      6     218.522187
      7     224.817313
      8     227.395125
      9     235.578095

Here's a few things to try:

set 'Price_Node_name' to the index before the loop

df.set_index('Price_Node_name', inplace=True)
for node in node_names:
    df_node = df[node]

use sort=False as a kwarg in the groupby

df_node.groupby(..., sort=False).mean()

Perform the rolling average AFTER the groupby, or don't do it at all--I don't think you need it in your case. Averaging the hourly totals for a month will give you the expected values for a typical day, which is what you desire. If you still want the rolling average, perform it on the averaged hourly totals for each month.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM