简体   繁体   English

Pandas 时间序列:在常规 10 分钟窗口内对不规则间隔数据进行分组和滚动平均值

[英]Pandas timeseries: groupby and rolling average of irregularly spaced data over regular 10-minute windows

I have a dataframe that looks like:我有一个看起来像的数据框:

|-----------------------------------------------------|
|                        | category   | pct_formation |
|-----------------------------------------------------|
|ts_timestamp            |            |               |
|-----------------------------------------------------|
|2018-10-22 10:13:44.043 | in_petr    | 37.07         |
|2018-10-22 10:17:09.527 | in_petr    | 36.97         |
|2018-10-22 10:17:43.977 | in_dsh     | 36.95         |
|2018-10-22 10:17:43.963 | in_dsh     | 36.96         |
|2018-10-22 10:17:09.527 | in_petr    | 32.96         |
|2018-10-22 10:19:44.040 | out_petr   | 36.89         |
|2018-10-23 10:19:44.043 | out_petr   | 36.90         |
|2018-10-23 10:19:37.267 | sync       | 33.91         |
|2018-10-23 10:19:44.057 | sync       | 36.96         |
|2018-10-23 10:19:16.750 | out_petr   | 36.88         |
|2018-10-23 10:20:03.160 | sync       | 36.98         |
|2018-10-23 10:20:32.350 | sync       | 37.00         |
|2018-10-23 10:23:03.150 | sync       | 34.58         |
|2018-10-23 10:22:18.633 | in_dsh     | 36.98         |
|2018-10-23 10:25:39.557 | in_dsh     | 36.97         |
|-----------------------------------------------------|

The data contains pct_formation values for various categories collected at different times every day (irregular frequency, unevenly spaced).数据包含pct_formation不同时间收集的各种类别的pct_formation值(不规则频率,不均匀间隔)。

I want to compare the average pct_formation of each category for a 10-minute rolling window between 9am and 11am, on each day or average over a week.我想比较每天上午 9 点和上午 11 点之间 10 分钟滚动窗口的每个类别的平均 pct_formation 或一周内的平均值。

The problem is that the data for each category does not always start coming in at 9am.问题是每个类别的数据并不总是在上午 9 点开始输入。 For some, it starts at 9.10am, for some at 9.15am, for some at 10am and so on.对于某些人来说,它从上午 9.10 开始,对于某些在上午 9.15 开始,对于某些在上午 10 点开始,依此类推。 Also, the data does not come at regular intervals.此外,数据不是定期出现的。 How can I get the 10-minute rolling average for each day and each category between 9am and 11am?如何获得每天上午 9 点至上午 11 点之间每个类别的 10 分钟滚动平均值?

Initially, I converted ts_timestamp column to an index:最初,我将ts_timestamp列转换为索引:

df = df.set_index('ts_timestamp')

Then, I can groupby and use rolling() as such:然后,我可以groupby并使用rolling()

df.groupby('category').rolling('10T').agg({'pct_formation': 'mean'})

However, this does not show me regular 10 minute intervals, but shows the timestamps from the dataframe.但是,这不会向我显示定期的 10 分钟间隔,而是显示数据帧中的时间戳。

I realize that I would need to create a data range like so to be used as index:我意识到我需要创建一个像这样用作索引的数据范围:

pd.date_range(start=df.index.min().replace(hour=9, minute=0, second=0, microsecond=0),
              end=df.index.max().replace(hour=11, minute=0, second=0, microsecond=0),
              freq='10T')
#
# or should I use freq='1T' so that rolling() can do 10 minute intervals?

But, how can I align my data frame with this range?但是,如何将我的数据框与此范围对齐? How can I average multiple values that occur between the range?如何平均范围之间出现的多个值?

I am new to working with time series data, and would appreciate any help.我是处理时间序列数据的新手,希望得到任何帮助。 Please feel free to ask if anything is not clear.请随时询问是否有任何不清楚的地方。

Using pd.Grouper :使用pd.Grouper

df.groupby(['category', pd.Grouper(key = 'ts_timestamp', freq = '10Min')]).\\ agg({'pct_formation': 'mean'})

Output:输出:

                                    pct
cat      ts                            
in_dsh   2018-10-22 10:10:00  36.955000
in_petr  2018-10-22 10:10:00  35.666667
out_petr 2018-10-22 10:10:00  36.890000
         2018-10-23 10:10:00  36.900000
sync     2018-10-23 10:10:00  35.435000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM