简体   繁体   English

如何在 Pandas 中将 timedeltas 与 resample 或 groupby 相加?

[英]How to sum timedeltas with resample or groupby in Pandas?

I have a DataFrame with TIME_IN and TIME_OUT columns (datetimes up to the second).我有一个带有 TIME_IN 和 TIME_OUT 列的 DataFrame(日期时间到秒)。 I want a new DF w/ the sum of the duration (TIME_OUT - TIME_IN) by date.我想要一个新的 DF,其中包含按日期计算的持续时间总和(TIME_OUT - TIME_IN)。 Each day runs from 5AM - 5AM, so I adjust for that as well.每天从早上 5 点到凌晨 5 点运行,所以我也对此进行了调整。

This is part of a mini-project to teach myself Pandas, but my next application will be much more involved, so EFFICIENCY is key for me.这是自学 Pandas 的小型项目的一部分,但我的下一个应用程序将涉及更多,因此效率对我来说是关键。

I've tried two approaches (resample and groupby), but both have the same issue: the timedelta DURATION column is not summing.我尝试了两种方法(resample 和 groupby),但都有相同的问题:timedelta DURATION 列没有求和。

df["DATE"] = pd.to_datetime((df["TIME_IN"]                                    
             - dt.timedelta(hours=hrEnd)).dt.date)
df["DURATION"] = df["TIME_OUT"] - df["TIME_IN"]

dfGroupBy= df.groupby("DATE").sum()

df.setindex("DATE", inplace=True)
dfResample = df.resample("D").sum()

It seems Pandas does not sum timedelta64 type columns the way I attempted, so the returned DataFrame simply does not include the DURATION column.似乎 Pandas 没有像我尝试的那样对 timedelta64 类型的列求和,所以返回的 DataFrame 根本不包括 DURATION 列。 What is the most efficient way to do this?最有效的方法是什么?

EDIT: Here is a sample of the raw data right in df:编辑:这是 df 中的原始数据示例: 在此处输入图像描述

you can use agg function of grouped object to sum duration like below您可以使用分组 object 的agg function 来计算持续时间,如下所示

import pandas as pd
import numpy as np

np.random.seed(10)

## Generate dummy data for testing
dt_range = pd.date_range("oct-12-2019", "oct-14-2019", freq="H")

arr = []
while len(arr)<10:
    i,j = np.random.choice(len(dt_range), 2)
    g = np.random.choice(4)
    if j>i:
        arr.append([g, dt_range[i], dt_range[j]])

df = pd.DataFrame(arr, columns=["group", "time_in", "time_out"])


## Solution
df["duration"] = df["time_out"] - df["time_in"]
df.groupby(df["time_in"].dt.date).agg({"duration":np.sum})

I think your code works as expected?我认为您的代码按预期工作?

df['TIME_IN'] = pd.to_datetime(df['TIME_IN'])
df['TIME_OUT'] = pd.to_datetime(df['TIME_OUT'])
df['DATE'] = (df['TIME_IN'] - datetime.timedelta(hours=5)).dt.date
df["DURATION"] = df["TIME_OUT"] - df["TIME_IN"] 
df.groupby("DATE")['DURATION'].sum()

Input into groupby输入到 groupby

    TIME_IN             TIME_OUT            DATE        DURATION
0   2019-05-06 11:46:51 2019-05-06 11:50:36 2019-05-06  00:03:45
1   2019-05-02 20:47:54 2019-05-02 20:52:22 2019-05-02  00:04:28
2   2019-05-05 07:39:02 2019-05-05 07:46:34 2019-05-05  00:07:32
3   2019-05-04 17:28:52 2019-05-04 17:32:57 2019-05-04  00:04:05
4   2019-05-05 14:08:26 2019-05-05 14:14:30 2019-05-05  00:06:04

Output after groupby分组后的 Output

DATE
2019-05-02   00:04:28
2019-05-04   00:04:05
2019-05-05   00:13:36
2019-05-06   00:03:45

Seems to work as expected.似乎按预期工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM