简体   繁体   中英

Calculating the mean for each time of day with a rolling window with pandas

I have a pandas dataframe that has a datetime index and four columns Phase 1 , Phase 2 , Phase 3 and Sum . The data was preprocessed and has a row every 15 minutes and is a few months long. The data is very cyclic and almost repeats every day but changes slowly over time. The goal is to produce the mean of the value at a certain time of day over the last week (or other timeframe) for all days. (for a machine learning task)

I've managed to calculate the mean for each time of day using this code: (This produces a 1-day long dataframe)

df.groupby(df.index.hour * 60 + df.index.minute).mean()
        Phase 1    Phase 2    Phase 3        Sum
Time                                            
0     10.105782  10.235237   9.990037  30.331055
15    10.106374  10.116440   9.991424  30.214238
30    10.106517  10.086310  10.003420  30.196246
45    10.128441  10.249100  10.032895  30.410436
...
1410  10.112582  10.643766   9.971592  30.727939
1425  10.102739  10.372299   9.969986  30.445025

This mean of all days together isn't very good though since the data changes gradually. It would be better if I could calculate this type of mean, but only include data from the last week for each day.

What I've tried so far is this:

df
  .groupby(df.index.hour * 60 + df.index.minute)
  .rolling("7D", closed="left")
  .mean()

It produces the correct data, but the date information is missing (it needs to be preserved for future calculations) and the rows are in the wrong order.

        Phase 1    Phase 2    Phase 3        Sum
Time                                            
0           NaN        NaN        NaN        NaN
0     10.064458  10.051470  10.177814  30.293742
0     10.043804   9.983143  10.062019  30.088965
0     10.020861   9.917236  10.000181  29.938278
...
0     10.224965  10.507418  10.030670  30.763053
0     10.155706  10.396408   9.919538  30.471651
0     10.149112  10.352153   9.894257  30.395522
0     10.144540  10.349998   9.902504  30.397042
15          NaN        NaN        NaN        NaN
15    10.061673   9.967295  10.143008  30.171976
15    10.059581  10.158814  10.051835  30.270230
15     9.995112  10.024808   9.999054  30.018974
...

Also there's the issue of NaN s appearing when the first day is not fully present. Do incomplete days need to be removed first or can they be incorporated into the mean?

I've also tried this:

df
  .groupby([
    pd.Grouper(freq="1D"),
    df.index.hour * 60 + df.index.minute
  ])
  .rolling("7D", closed="left")
  .mean()

But it produces a dataframe conisting only of NaN s so something must be going very wrong.

The result is supposed to look something like this:

                       Phase 1    Phase 2    Phase 3        Sum
Time                                                           
2021-02-13 00:00:00  11.882597  12.779326  12.458625  37.120549
2021-02-13 00:15:00  11.866148  12.871785  12.509614  37.247547
2021-02-13 00:30:00  11.713676  12.730861  12.525868  36.970405
2021-02-13 00:45:00  11.742079  12.697406  12.592411  37.031897
2021-02-13 01:00:00  11.765234  12.848741  12.622687  37.236662
...
2021-05-01 10:30:00  11.842673  12.190760  12.572203  36.605636
2021-05-01 10:45:00  11.837964  12.118095  12.611271  36.567331
2021-05-01 11:00:00  11.827275  12.220564  12.588131  36.635970

In this example, the second row contains the average values of 2021-02-13 00:15:00 , 2021-02-12 00:15:00 , ..., 2021-02-07 00:15:00 . I'm not new to programming, but relatively new to python and pandas so any help and hints are very much appreciated.

You can pre-filter the dataset to only include 13 days preceding the dt date, then groupby time, taking 7 days rolling with min_periods=7 , take mean and dropna to remove dates that have accumulated values for fewer than 7 of the previous days:

# generate sample dataset
ix = pd.date_range('2021-01-01', '2021-05-01', freq='15min')
df = pd.DataFrame({
        'Phase1': np.random.uniform(0, 1, len(ix)),
        'Phase2': np.random.uniform(0, 1, len(ix)),
        'Phase3': np.random.uniform(0, 1, len(ix)),
    }, index=ix)
df['Sum'] = df.sum(1)

# set max date
dt = pd.to_datetime('2021-02-14')

# filter out values in [dt - 13 days, dt)
z = df.loc[(df.index >= dt - pd.DateOffset(days=13)) & (df.index < dt)]

# calculate 7-day rolling average for the same time of the day
# for 7 days preceding `dt`
(z
     .groupby(z.index.time)
     .rolling('7d', min_periods=7)
     .mean()
     .dropna()
     .droplevel(0)
     .sort_index())

Output:

                       Phase1    Phase2    Phase3       Sum
2021-02-07 00:00:00  0.479466  0.731746  0.503017  1.714229
2021-02-07 00:15:00  0.443550  0.423135  0.543204  1.409889
2021-02-07 00:30:00  0.465272  0.626117  0.454462  1.545851
2021-02-07 00:45:00  0.528733  0.433475  0.386822  1.349029
2021-02-07 01:00:00  0.425309  0.360065  0.488509  1.273884
...                       ...       ...       ...       ...
2021-02-13 22:45:00  0.519717  0.490549  0.524330  1.534596
2021-02-13 23:00:00  0.367935  0.460093  0.373338  1.201366
2021-02-13 23:15:00  0.597424  0.438130  0.478259  1.513813
2021-02-13 23:30:00  0.675142  0.443580  0.330791  1.449514
2021-02-13 23:45:00  0.474604  0.355723  0.596467  1.426794

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM