简体   繁体   English

熊猫获得组内的平均时间间隔

[英]Pandas get average time interval within groups

I have a DataFrame containing an EffectiveDate column. 我有一个包含EffectiveDate列的DataFrame。 I want to groupby the DataFrame by a Key value and then calculate the average time interval for all the date values in each group for the EffectiveDate column. 我想按一个Key值对DataFrame进行分组,然后为EffectiveDate列计算每个组中所有日期值的平均时间间隔。

For example for the DataFrame: 例如对于DataFrame:

    EffectiveDate
1   2015-08-17 07:00:00
1   2015-08-18 07:00:00
1   2015-08-19 07:00:00
2   2015-08-20 07:00:00
2   2015-08-21 07:00:00
2   2015-09-16 07:00:00
2   2015-10-15 07:00:00
2   2015-11-16 08:00:00

I want to groupby the Index and calculate the average interval between the date values in the EffectiveDate column. 我想对索引进行分组,并计算EffectiveDate列中日期值之间的平均间隔。

15199   2015-08-17 07:00:00
15214   2015-08-18 07:00:00
15219   2015-08-19 07:00:00
15233   2015-08-20 07:00:00
15254   2015-08-21 07:00:00
15687   2015-09-16 07:00:00
199     2015-10-15 07:00:00
1123    2015-11-16 08:00:00
Name: EffectiveDate, dtype: datetime64[ns]

On a single Series this seems to work fine: 在单个系列中,这似乎可以正常工作:

EffectiveDate.diff().astype('timedelta64[s]').mean()

However when I am using the same function as a groupby aggregate on a pandas DataFrame: 但是,当我在pandas DataFrame上使用与groupby聚合相同的功能时:

df.groupby('Key').agg({
    'EffectiveDate': lambda x: x.diff().astype('timedelta64[s]').mean()
})

The results are 结果是

                  EffectiveDate                               
1 1970-01-01 00:00:00.016747425
2 1970-01-01 00:00:00.017765280
3 1970-01-01 00:00:00.034776096
4 1970-01-01 00:00:00.002052450
5 1970-01-01 00:00:00.018238800
6 1970-01-01 00:00:00.024005438 
7 1970-01-01 00:00:00.012330000

I would expect an integer field in each column. 我希望每一列都有一个整数字段。 I am using Pandas 0.19.2 . 我正在使用Pandas 0.19.2

GroupBy.agg seems to attempt to cast back to the original dtype of the EffectiveDate column in 0.19.2 . GroupBy.agg似乎尝试 0.19.2中的EffectiveDate列的原始0.19.2 This might make sense generally I think, as we would expect an aggregation down the column to have the same dtype. 我认为这通常可能是合理的,因为我们希望该列下方的聚合具有相同的dtype。

To fix this issue, you could use GroupBy.apply instead in 0.19.2 , which doesn't perform the same cast afterwards. 要解决此问题,您可以在0.19.2使用GroupBy.apply ,此后不执行相同的转换。

df.groupby(df.index).apply(
    lambda x: x.diff().astype('timedelta64[s]').mean()
)

Seemingly this didn't used to be the case, as I can reproduce your behavior in 0.18.1 only after casting to the original dtype of the EffectiveDate column. 似乎情况并非如此,因为只有在转换 EffectiveDate列的原始0.18.1后,我才能在0.18.1重现您的行为。

In 0.18.1 0.18.1

>>> df
        EffectiveDate
1 2015-08-17 07:00:00
1 2015-08-18 07:00:00
1 2015-08-19 07:00:00
2 2015-08-20 07:00:00
2 2015-08-21 07:00:00
2 2015-09-16 07:00:00
2 2015-10-15 07:00:00
2 2015-11-16 08:00:00

>>> df.groupby(df.index).agg({
        'EffectiveDate': lambda x: x.diff().astype('timedelta64[s]').mean()
})

   EffectiveDate
1        86400.0
2      1901700.0

>>> df.groupby(df.index).agg({
        'EffectiveDate': lambda x: x.diff().astype('timedelta64[s]').mean()
}).astype(df.EffectiveDate.dtype)

                  EffectiveDate
1 1970-01-01 00:00:00.000086400
2 1970-01-01 00:00:00.001901700

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM