简体   繁体   English

Pandas groupby聚合截断最早的日期而不是最早的日期

[英]Pandas groupby aggregation to truncate earliest date instead of oldest date

I'm trying to aggregate from the end of a date range instead of from the beginning. 我试图从日期范围的末尾而不是从头开始汇总。 Despite the fact that I would think that adding closed='right' to the grouper would solve the issue, it doesn't. 尽管我认为向石斑鱼添加closed='right'可以解决问题,但事实并非如此。 Please let me know how I can achieve my desired output shown at the bottom, thanks. 请让我知道如何在底部显示我想要的输出,谢谢。

import pandas as pd
df = pd.DataFrame(columns=['date','number'])
df['date'] = pd.date_range('1/1/2000', periods=8, freq='T')
df['number'] = pd.Series(range(8))
df

    date                number
0   2000-01-01 00:00:00 0
1   2000-01-01 00:01:00 1
2   2000-01-01 00:02:00 2
3   2000-01-01 00:03:00 3
4   2000-01-01 00:04:00 4
5   2000-01-01 00:05:00 5
6   2000-01-01 00:06:00 6
7   2000-01-01 00:07:00 7

With the groupby and aggregation of the date I get the following. 通过groupby和聚合日期,我得到以下内容。 Since I have 8 dates and I'm grouping by periods of 3 it must choose whether to truncate the earliest date group or the oldest date group, and it chooses the oldest date group (the oldest date group has a count of 2): 由于我有8个日期,并且我按3期分组,因此必须选择是截断最早的日期组还是最早的日期组,并选择最早的日期组(最早的日期组的计数为2):

df.groupby(pd.Grouper(key='date', freq='3T')).agg('count')

date                number
2000-01-01 00:00:00 3
2000-01-01 00:03:00 3
2000-01-01 00:06:00 2

My desired output is to instead truncate the earliest date group: 我想要的输出是截断最早的日期组:

date                number
2000-01-01 00:00:00 2
2000-01-01 00:02:00 3
2000-01-01 00:05:00 3

Please let me know how this can be achieved, I'm hopeful there's just a parameter that can be set that I've overlooked. 请告诉我这是如何实现的,我希望只有一个我可以忽略的参数。 Note that this is similar to this question, but my question is specific to the date truncation. 请注意,这与问题类似,但我的问题特定于日期截断。

EDIT: To reframe the question (thanks Alexdor) the default behavior in pandas is to bin by period [0, 3), [3, 6), [6, 9) but instead I'd like to bin by (-1, 2], (2, 5], (5, 8] 编辑:要重新设置问题(感谢Alexdor),pandas中的默认行为是按句点[0,3],[3,6],[6,9]进行分区,但我想依靠(-1, 2],(2,5),(5,8)

It seems like the grouper function build up the bins starting from the oldest time in the series that you pass to it. 似乎石斑鱼功能从你传递给它的系列中最古老的时间开始构建了垃圾箱。 I couldn't see a way to make it build up the bins from the newest time, but it's fairly easy to construct the bins from scratch. 我无法看到从最新的时间开始构建垃圾箱的方法,但从头开始构建垃圾箱相当容易。

freq = '3min'

minTime = df.date.min()
maxTime = df.date.max()
deltaT = pd.Timedelta(freq)
minTime -= deltaT - (maxTime - minTime) % deltaT # adjust min time to start of first bin
r = pd.date_range(start=minTime, end=maxTime, freq=freq)

df.groupby(pd.cut(df["date"], r)).agg('count')

Gives

date                                     date number        
(1999-12-31 23:58:00, 2000-01-01 00:01:00]  2   2
(2000-01-01 00:01:00, 2000-01-01 00:04:00]  3   3
(2000-01-01 00:04:00, 2000-01-01 00:07:00]  3   3

This is one hack, which let's you group by a constant group size, counting bottom up. 这是一个黑客攻击,让你按照一个不变的组大小进行分组,从下到上计算。

from itertools import chain

def grouper(x, k=3):
    n = len(df.index)
    return list(chain.from_iterable([[0]*int(n//k)] + [[i]*k for i in range(1, int(n/k)+1)]))

df['grouper'] = grouper(df, 3)

res = df.groupby('grouper', as_index=False)\
        .agg({'date': 'first', 'number': 'count'})\
        .drop('grouper', 1)

#                  date  number
# 0 2000-01-01 00:00:00       2
# 1 2000-01-01 00:02:00       3
# 2 2000-01-01 00:05:00       3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM