[英]Calculating row-wise time difference in python
我想根据他们第一次上车的时刻和他们离开的时刻之间的差异来计算我的数据框中每个乘客的旅行时间。
这是数据框
my_df = pd.DataFrame({
'id': ['a', 'b', 'b', 'b', 'b', 'b', 'c','d'],
'date': ['2020/02/03', '2020/04/05', '2020/04/05', '2020/04/05','2020/04/06', '2020/04/06', '2020/12/15', '2020/06/23'],
'arriving_time': ['14:36:06', '08:52:02', '08:53:02', '08:55:24', '18:58:03', '19:03:05', '17:04:28', '21:31:23'],
'leaving_time': ['14:40:05', '08:52:41', '08:54:33', '08:57:14', '19:01:07', '19:04:08', '17:09:48', '21:50:12']
})
print(my_df)
output:
id date arriving_time leaving_time
0 a 2020/02/03 14:36:06 14:40:05
1 b 2020/04/05 08:52:02 08:52:41
2 b 2020/04/05 08:53:02 08:54:33
3 b 2020/04/05 08:55:24 08:57:14
4 b 2020/04/06 18:58:03 19:01:07
5 b 2020/04/06 19:03:05 19:04:08
6 c 2020/12/15 17:04:28 17:09:48
7 d 2020/06/23 21:31:23 21:50:12
但是有两个问题(我无法自己解决):
这是我想要获得的结果
id date arriving_time leaving_time travelTime
0 a 2020/02/03 14:36:06 14:40:05 00:03:59
1 b 2020/04/05 08:52:02 08:52:41 00:05:12
2 b 2020/04/05 08:53:02 08:54:33 00:05:12
3 b 2020/04/05 08:55:24 08:57:14 00:05:12
4 b 2020/04/06 18:58:03 19:01:07 00:06:05
5 b 2020/04/06 19:03:05 19:04:08 00:06:05
6 c 2020/12/15 17:04:28 17:09:48 00:05:20
7 d 2020/06/23 21:31:23 21:50:12 00:18:49
如您所见,乘客 b 在同一天进行了两次不同的旅行,我想知道计算每一次旅行的持续时间。
我已经尝试了下面的代码,它似乎有效,但它真的很慢(我认为这是由于 my_df 的行数很大)
for user_id in set(my_df.id):
for day in set(my_df.loc[my_df.id == user_id, 'date']):
my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'travelTime'] = max(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'leaving_time'].apply(pd.to_datetime)) - min(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'arriving_time'].apply(pd.to_datetime))
我认为正确最大和最小值转换列,日期时间,然后减去Series
创建由GroupBy.transform
:
my_df['s'] = pd.to_datetime(my_df['date'] + ' ' + my_df['arriving_time'])
my_df['e'] = pd.to_datetime(my_df['date'] + ' ' + my_df['leaving_time'])
g = my_df.groupby(['id', 'date'])
my_df['travelTime'] = g['e'].transform('max').sub(g['s'].transform('min'))
print (my_df)
id date arriving_time leaving_time s \
0 a 2020/02/03 14:36:06 14:40:05 2020-02-03 14:36:06
1 b 2020/04/05 08:52:02 08:52:41 2020-04-05 08:52:02
2 b 2020/04/05 08:53:02 08:54:33 2020-04-05 08:53:02
3 b 2020/04/05 08:55:24 08:57:14 2020-04-05 08:55:24
4 b 2020/04/06 18:58:03 19:01:07 2020-04-06 18:58:03
5 b 2020/04/06 19:03:05 19:04:08 2020-04-06 19:03:05
6 c 2020/12/15 17:04:28 17:09:48 2020-12-15 17:04:28
7 d 2020/06/23 21:31:23 21:50:12 2020-06-23 21:31:23
e travelTime
0 2020-02-03 14:40:05 00:03:59
1 2020-04-05 08:52:41 00:05:12
2 2020-04-05 08:54:33 00:05:12
3 2020-04-05 08:57:14 00:05:12
4 2020-04-06 19:01:07 00:06:05
5 2020-04-06 19:04:08 00:06:05
6 2020-12-15 17:09:48 00:05:20
7 2020-06-23 21:50:12 00:18:49
为了避免新列,可以使用DataFrame.assign
Series with datetimes
:
s = pd.to_datetime(my_df['date'] + ' ' + my_df['arriving_time'])
e = pd.to_datetime(my_df['date'] + ' ' + my_df['leaving_time'])
g = my_df.assign(s=s, e=e).groupby(['id', 'date'])
my_df['travelTime'] = g['e'].transform('max').sub(g['s'].transform('min'))
print (my_df)
id date arriving_time leaving_time travelTime
0 a 2020/02/03 14:36:06 14:40:05 00:03:59
1 b 2020/04/05 08:52:02 08:52:41 00:05:12
2 b 2020/04/05 08:53:02 08:54:33 00:05:12
3 b 2020/04/05 08:55:24 08:57:14 00:05:12
4 b 2020/04/06 18:58:03 19:01:07 00:06:05
5 b 2020/04/06 19:03:05 19:04:08 00:06:05
6 c 2020/12/15 17:04:28 17:09:48 00:05:20
7 d 2020/06/23 21:31:23 21:50:12 00:18:49
IIUC我们首先groupby
id
和date
,以获得最大和最小休假及到达时间。
然后是一个简单的减法。
df2 = df.groupby(['id','date']).agg(min_arrival=('arriving_time','min'),
max_leave=('leaving_time','max'))
df2['travelTime'] = pd.to_datetime(df2['max_leave']) - pd.to_datetime(df2['min_arrival'])
print(df2)
min_arrival max_leave travelTime
id date
a 2020-02-03 14:36:06 14:40:05 00:03:59
b 2020-04-05 08:52:02 08:57:14 00:05:12
2020-04-06 18:58:03 19:04:08 00:06:05
c 2020-12-15 17:04:28 17:09:48 00:05:20
d 2020-06-23 21:31:23 21:50:12 00:18:49
如果您希望将其恢复到原始 df 上,您可以使用transform
或将新增量中的值合并到您的原始值上:
df_new = (pd.merge(df,df2[['travelTime']],on=['date','id'],how='left')
id date arriving_time leaving_time travelTime
0 a 2020-02-03 14:36:06 14:40:05 00:03:59
1 b 2020-04-05 08:52:02 08:52:41 00:05:12
2 b 2020-04-05 08:53:02 08:54:33 00:05:12
3 b 2020-04-05 08:55:24 08:57:14 00:05:12
4 b 2020-04-06 18:58:03 19:01:07 00:06:05
5 b 2020-04-06 19:03:05 19:04:08 00:06:05
6 c 2020-12-15 17:04:28 17:09:48 00:05:20
7 d 2020-06-23 21:31:23 21:50:12 00:18:49
你可以试试这个——
my_df['arriving_time'] = pd.to_datetime(my_df['arriving_time'])
my_df['leaving_time'] = pd.to_datetime(my_df['leaving_time'])
my_df['travel_time'] = my_df.groupby(['id', 'date'])['leaving_time'].transform('max') - my_df.groupby(['id', 'date'])['arriving_time'].transform('min')
my_df
id date arriving_time leaving_time travel_time
0 a 2020/02/03 2020-03-19 14:36:06 2020-03-19 14:40:05 00:03:59
1 b 2020/04/05 2020-03-19 08:52:02 2020-03-19 08:52:41 00:05:12
2 b 2020/04/05 2020-03-19 08:53:02 2020-03-19 08:54:33 00:05:12
3 b 2020/04/05 2020-03-19 08:55:24 2020-03-19 08:57:14 00:05:12
4 b 2020/04/06 2020-03-19 18:58:03 2020-03-19 19:01:07 00:06:05
5 b 2020/04/06 2020-03-19 19:03:05 2020-03-19 19:04:08 00:06:05
6 c 2020/12/15 2020-03-19 17:04:28 2020-03-19 17:09:48 00:05:20
7 d 2020/06/23 2020-03-19 21:31:23 2020-03-19 21:50:12 00:18:49
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.