基于多个条件（用户 ID、结束日期 = 开始日期等）聚合 Pandas DataFrame 行

Question

我已经阅读了很多关于如何在 pandas dataframe 中聚合行的答案，但我很难弄清楚如何将其应用于我的案例。 我有一个包含车辆行程数据的 dataframe。 因此，在给定的一天内，每辆车可以进行多次旅行。 下面是一个示例：

车牌号	开始 pos 时间	结束时间	持续时间（秒）	米行进
XXXXXX	2021-10-26 06:01:12+00:00	2021-10-26 06:25:06+00:00	1434	2000
XXXXXX	2021-10-19 13:49:09+00:00	2021-10-19 13:59:29+00:00	620	5000
XXXXXX	2021-10-19 13:20:36+00:00	2021-10-19 13:26:40+00:00	364	70000
年年年年	2022-09-10 15:14:07+00:00	2022-09-10 15:29:39+00:00	932	8000
年年年年	2022-08-28 15:16:35+00:00	2022-08-28 15:28:43+00:00	728	90000

经常发生的情况是，在某一天，一次旅行的开始时间仅比前一次旅行的结束时间晚几分钟，这意味着这些可以链接到一次旅行中。

我想聚合这些行，以便如果新的开始 pos 时间与前一个 pos 时间重叠，或者两者之间的间隔小于 30 分钟，则这些行将成为单行，以秒为单位求和旅行的持续时间和米行进，显然是车辆ID。 新的 df 还应该包含那些不需要聚合的旅行（为清楚起见进行了编辑）。 所以这是我想要得到的 output：

车牌号	开始 pos 时间	结束时间	持续时间（秒）	米行进
XXXXXX	2021-10-26 06:01:12+00:00	2021-10-26 06:25:06+00:00	1434	2000
XXXXXX	2021-10-19 13:20:36+00:00	2021-10-19 13:59:29+00:00	984	75000
年年年年	2022-09-10 15:14:07+00:00	2022-09-10 15:29:39+00:00	932	8000
年年年年	2022-08-28 15:16:35+00:00	2022-08-28 15:28:43+00:00	728	90000

我觉得一个 groupby 和一个 agg 会涉及到我不知道如何 go 关于这个。 任何帮助，将不胜感激！ 谢谢！

Answer 1

可能有一种更有效的方法来编写它，但是这样的东西应该可以工作（ new_df 有你正在寻找的东西）：

注意：下面的代码假定开始和结束时间是日期时间格式


df = pd.DataFrame({'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
                      4: 'YYYYY'},
        'start pos time': {0: '2021-10-26 06:01:12+00:00',
                           1: '2021-10-19 13:49:09+00:00',
                           2: '2021-10-19 13:20:36+00:00',
                           3: '2022-09-10 15:14:07+00:00',
                           4: '2022-08-28 15:16:35+00:00'},
        'end pos time': {0: '2021-10-26 06:25:06+00:00',
                         1: '2021-10-19 13:59:29+00:00',
                         2: '2021-10-19 13:26:40+00:00',
                         3: '2022-09-10 15:29:39+00:00',
                         4: '2022-08-28 15:28:43+00:00'},
        'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
        'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
        })

# sort dataframe by ID and then start time of trip
df = df.sort_values(by=['vehicleID', 'start pos time'])

# create a new column with the end time of the previous ride
df.loc[:, 'prev end'] = df['end pos time'].shift(1)

# create a new column with the difference between the start time of the current trip and the end time of the prior one
df.loc[:, 'diff'] = df.loc[:, 'start pos time'] - df.loc[:, 'prev end']


# helper function to convert difference between datetime objects to seconds
def get_total_seconds(datetime_delta):
    return datetime_delta.total_seconds()


# convert difference column to seconds
df.loc[:, 'diff'] = df['diff'].apply(get_total_seconds)

# where vehicle IDs are the same and the difference between the start time of the current trip and end time of the
# prior trip is less than or equal to 30 minutes, change the start time of the current trip to the start time of the 
# prior one
df.loc[((df['vehicleID'] == df['vehicleID'].shift(1)) & (df['diff'] <= 30*60)), 'start pos time'] = df['start pos time'].shift(1)

# create a new dataframe, grouped by vehicle ID and trip start time, using the maximum end time for each group
new_df = df.groupby(['vehicleID', 'start pos time'], as_index=False).agg({'end pos time':'max',
                                                                          'duration (seconds)':'sum',
                                                                          'meters travelled':'sum'})

编辑：如果可能需要汇总 >2 次旅行（正如@ouroboros1 指出的那样），您可以将“将差异列转换为秒”代码之后的所有内容替换为：

# [based on @ouroboros1 solution] where vehicle IDs are the same and the difference between the start time of the current
# trip and end time of the prior trip is less than or equal to 30 minutes, put trips in the same "group"
df.loc[:, 'group'] = ((df['vehicleID'] != df['vehicleID'].shift(1)) | (df['diff'] > 30*60)).cumsum()

# create a new dataframe, grouped by vehicle ID and group, using the minimum start time and maximum end time for each group
new_df = df.groupby(['vehicleID', 'group'], as_index=False).agg({'start pos time':'min',
                                                                 'end pos time':'max',
                                                                 'duration (seconds)':'sum',
                                                                 'meters travelled':'sum'})

Answer 2

我相信我已经找到了解决办法。

设置

import pandas as pd
from datetime import timedelta

data = {'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY', 
                      4: 'YYYYY'}, 
        'start pos time': {0: '2021-10-26 06:01:12+00:00', 
                           1: '2021-10-19 13:49:09+00:00', 
                           2: '2021-10-19 13:20:36+00:00', 
                           3: '2022-09-10 15:14:07+00:00', 
                           4: '2022-08-28 15:16:35+00:00'}, 
        'end pos time': {0: '2021-10-26 06:25:06+00:00', 
                         1: '2021-10-19 13:59:29+00:00', 
                         2: '2021-10-19 13:26:40+00:00', 
                         3: '2022-09-10 15:29:39+00:00', 
                         4: '2022-08-28 15:28:43+00:00'}, 
        'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728}, 
        'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
        }

df = pd.DataFrame(data)

假设：

col vehicleID中的所有组（唯一值）都是连续的。
col vehicleID中的每个组，col start pos time中的相关时间戳按降序排序。

问题

在 col vehicleID内的每个组内，如果开始 pos 时间小于上一次行程的结束 pos 时间（即在下一行），或者小于 30 分钟，则这些行应该成为单行，其中min为开始位置时间，结束位置时间的max ，以及持续时间和行驶米数的sum 。

解决方案

# if still needed, change date time strings into timestamps
df[['start pos time','end pos time']] = df[['start pos time','end pos time']].\
    apply(lambda x: pd.to_datetime(x, infer_datetime_format=True))

# check (start time + timedelta 29m+59s) < (end time shifted)
cond1 = (df.loc[:,'end pos time']+timedelta(minutes=29, seconds=59))\
    .lt(df.loc[:,'start pos time'].shift(1))

# check `vehicleID` != it's own shift (this means a new group is starting)
# i.e. a new group should always get `True`
cond2 = (df.loc[:,'vehicleID'] != df.loc[:,'vehicleID'].shift(1))

# cumsum result of OR check conds
cond = (cond1 | cond2).cumsum()

# apply groupby on ['vehicleID' & cond] and aggregate appropriate functions
# (adding vehicleID is now unnecessary, but this keeps the col in the data)
res = df.groupby(['vehicleID', cond], as_index=False).agg(
    {'start pos time':'min',
     'end pos time':'max',
     'duration (seconds)':'sum',
     'meters travelled':'sum'}
    )

print(res)

  vehicleID            start pos time              end pos time  \
0     XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00   
1     XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:59:29+00:00   
2     YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00   
3     YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00   

   duration (seconds)  meters travelled  
0                1434              2000  
1                 984             75000  
2                 932              8000  
3                 728             90000

我进行了检查：如果您连续两次以上的行程连续保持在定义的范围内，则解决方案也应该起作用。

更新：在@BeRT2me 的answer中，合并到新行的所有原始行的duration (seconds)的值不会相加，而是根据新的开始和结束时间重新计算持续时间。 这很有意义。 如果你想用我的方法做到这一点，只需将代码的最后一部分调整如下：

# cut out `duration` here:
res = df.groupby(['vehicleID', cond], as_index=False).agg(
    {'start pos time':'min',
     'end pos time':'max',
     # 'duration (seconds)':'sum',
     'meters travelled':'sum'}
    )

# and recalculate the duration
res['duration (seconds)'] = res['end pos time'].\
    sub(res['start pos time']).dt.total_seconds()

Answer 3

def func(d):
    mask = d.start_pos_time.sub(d.end_pos_time.shift(-1)).lt('30m')
    d.loc[mask, 'start_pos_time'] = d.start_pos_time.shift(-1)
    d = d.groupby('start_pos_time', as_index=False).agg({'end_pos_time': 'max', 'meters_travelled': 'sum'})
    return d

df = df.groupby('vehicleID').apply(func).reset_index('vehicleID').reset_index(drop=True)

df['duration_(seconds)'] = (df.end_pos_time - df.start_pos_time).dt.total_seconds()
print(df)

Output：

  vehicleID            start_pos_time              end_pos_time  meters_travelled  duration_(seconds)
0     XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:59:29+00:00             75000              2333.0
1     XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00              2000              1434.0
2     YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00             90000               728.0
3     YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00              8000               932.0

基于多个条件（用户 ID、结束日期 = 开始日期等）聚合 Pandas DataFrame 行

问题描述

3 个解决方案

解决方案1
1 已采纳 2022-09-15 19:35:44

解决方案2
1 2022-09-15 23:16:00

解决方案3
0 2022-09-16 00:39:47

基于多个条件（用户 ID、结束日期 = 开始日期等）聚合 Pandas DataFrame 行

问题描述

3 个解决方案

解决方案1 1 已采纳 2022-09-15 19:35:44

解决方案2 1 2022-09-15 23:16:00

解决方案3 0 2022-09-16 00:39:47

解决方案1
1 已采纳 2022-09-15 19:35:44

解决方案2
1 2022-09-15 23:16:00

解决方案3
0 2022-09-16 00:39:47