I have a list of nodes (about 2300 of them) that have hourly price data for about a year. I have a script that, for each node, loops through the times of the day to create a 4-hour trailing average, then groups the averages by month and hour. Finally, these hours in a month are averaged to give, for each month, a typical day of prices. I'm wondering if there is a faster way to do this because what I have seems to take a significant amount of time (about an hour). I also save the dataframes as csv files for later visualization (that's not the slow part).
df (before anything is done to it)
Price_Node_Name Local_Datetime_HourEnding Price Irrelevant_column
0 My-node 2016-08-17 01:00:00 20.95 EST
1 My-node 2016-08-17 02:00:00 21.45 EST
2 My-node 2016-08-17 03:00:00 25.60 EST
df_node (after the groupby as it looks going to csv)
Month Hour MA
1 0 23.55
1 1 23.45
1 2 21.63
for node in node_names:
df_node = df[df['Price_Node_Name'] == node]
df_node['MA'] = df_node['Price'].rolling(4).mean()
df_node = df_node.groupby([df_node['Local_Datetime_HourEnding'].dt.month,
df_node['Local_Datetime_HourEnding'].dt.hour]).mean()
df_node.to_csv('%s_rollingavg.csv' % node)
I get an weak error warning me about SetWithCopy, but I haven't quite figured out how to use .loc here since the column ['MA'] doesn't exist until I create it in this snippet and any way I can think of to create it before hand and fill it seems slower than what I have. Could be totally wrong though. Any help would be great.
python 3.6
edit: I might have misread the question here, hopefully this at least sparks some ideas for the solution.
I think it is useful to have the index as the datetime column when working with time series data in Pandas.
Here is some sample data:
Out[3]:
price
date
2015-01-14 00:00:00 155.427361
2015-01-14 01:00:00 205.285202
2015-01-14 02:00:00 205.305021
2015-01-14 03:00:00 195.000000
2015-01-14 04:00:00 213.102000
2015-01-14 05:00:00 214.500000
2015-01-14 06:00:00 222.544375
2015-01-14 07:00:00 227.090251
2015-01-14 08:00:00 227.700000
2015-01-14 09:00:00 243.456190
We use Series.rolling
to create an MA
column, ie we apply the method to the price
column, with a two-period window, and call mean
on the resulting rolling
object:
In [4]: df['MA'] = df.price.rolling(window=2).mean()
In [5]: df
Out[5]:
price MA
date
2015-01-14 00:00:00 155.427361 NaN
2015-01-14 01:00:00 205.285202 180.356281
2015-01-14 02:00:00 205.305021 205.295111
2015-01-14 03:00:00 195.000000 200.152510
2015-01-14 04:00:00 213.102000 204.051000
2015-01-14 05:00:00 214.500000 213.801000
2015-01-14 06:00:00 222.544375 218.522187
2015-01-14 07:00:00 227.090251 224.817313
2015-01-14 08:00:00 227.700000 227.395125
2015-01-14 09:00:00 243.456190 235.578095
And if you want month
and hour
columns, can extract those from the index:
In [7]: df['month'] = df.index.month
In [8]: df['hour'] = df.index.hour
In [9]: df
Out[9]:
price MA month hour
date
2015-01-14 00:00:00 155.427361 NaN 1 0
2015-01-14 01:00:00 205.285202 180.356281 1 1
2015-01-14 02:00:00 205.305021 205.295111 1 2
2015-01-14 03:00:00 195.000000 200.152510 1 3
2015-01-14 04:00:00 213.102000 204.051000 1 4
2015-01-14 05:00:00 214.500000 213.801000 1 5
2015-01-14 06:00:00 222.544375 218.522187 1 6
2015-01-14 07:00:00 227.090251 224.817313 1 7
2015-01-14 08:00:00 227.700000 227.395125 1 8
2015-01-14 09:00:00 243.456190 235.578095 1 9
Then we can use groupby
:
In [11]: df.groupby([
...: df['month'],
...: df['hour']
...: ]).mean()[['MA']]
Out[11]:
MA
month hour
1 0 NaN
1 180.356281
2 205.295111
3 200.152510
4 204.051000
5 213.801000
6 218.522187
7 224.817313
8 227.395125
9 235.578095
Here's a few things to try:
set 'Price_Node_name' to the index before the loop
df.set_index('Price_Node_name', inplace=True)
for node in node_names:
df_node = df[node]
use sort=False
as a kwarg in the groupby
df_node.groupby(..., sort=False).mean()
Perform the rolling average AFTER the groupby, or don't do it at all--I don't think you need it in your case. Averaging the hourly totals for a month will give you the expected values for a typical day, which is what you desire. If you still want the rolling average, perform it on the averaged hourly totals for each month.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.