简体   繁体   English

计算python中的逐行时间差

[英]Calculating row-wise time difference in python

I want to calculate the travel time of each passengers in my data frame based on the difference between the moment where they first get in the bus and the moment they leave.我想根据他们第一次上车的时刻和他们离开的时刻之间的差异来计算我的数据框中每个乘客的旅行时间。

Here is the data frame这是数据框

my_df = pd.DataFrame({
    'id': ['a', 'b', 'b', 'b', 'b', 'b', 'c','d'],
    'date': ['2020/02/03', '2020/04/05', '2020/04/05', '2020/04/05','2020/04/06', '2020/04/06', '2020/12/15', '2020/06/23'],
    'arriving_time': ['14:36:06', '08:52:02', '08:53:02', '08:55:24', '18:58:03', '19:03:05', '17:04:28', '21:31:23'],
    'leaving_time': ['14:40:05', '08:52:41', '08:54:33', '08:57:14', '19:01:07', '19:04:08', '17:09:48', '21:50:12']
})
print(my_df)

output:

    id  date    arriving_time   leaving_time
0   a   2020/02/03  14:36:06    14:40:05
1   b   2020/04/05  08:52:02    08:52:41
2   b   2020/04/05  08:53:02    08:54:33
3   b   2020/04/05  08:55:24    08:57:14
4   b   2020/04/06  18:58:03    19:01:07
5   b   2020/04/06  19:03:05    19:04:08
6   c   2020/12/15  17:04:28    17:09:48
7   d   2020/06/23  21:31:23    21:50:12

However there is two problems (that I don't manage to solve myself):但是有两个问题(我无法自己解决):

  • passengers are detected via their phone signal but the signal is often unstable, this is why for a same person, we can have many rows (like the passenger b in the above data set).乘客是通过手机信号检测到的,但信号往往不稳定,这就是为什么对于同一个人,我们可以有很多行(如上述数据集中的乘客 b)。 "arriving_time" is the time where the signal is detected and "leaving_time" the time where the signal is lost “arriving_time”是检测到信号的时间,“leaving_time”是信号丢失的时间
  • To compute the travel time, I need to substract, for each unique ID and for each travel, the least recent arriving_time to the most recent leaving time.为了计算旅行时间,我需要将每个唯一 ID 和每次旅行的最近到达时间减去最近离开时间。

Here is the result I want to obtain这是我想要获得的结果

id  date    arriving_time   leaving_time    travelTime
0   a   2020/02/03  14:36:06    14:40:05    00:03:59
1   b   2020/04/05  08:52:02    08:52:41    00:05:12
2   b   2020/04/05  08:53:02    08:54:33    00:05:12
3   b   2020/04/05  08:55:24    08:57:14    00:05:12
4   b   2020/04/06  18:58:03    19:01:07    00:06:05
5   b   2020/04/06  19:03:05    19:04:08    00:06:05
6   c   2020/12/15  17:04:28    17:09:48    00:05:20
7   d   2020/06/23  21:31:23    21:50:12    00:18:49

As you can see, passenger b made two different travel on the same day, and I want to know compute how long each one of them last.如您所见,乘客 b 在同一天进行了两次不同的旅行,我想知道计算每一次旅行的持续时间。

I already tried the following code, which seems to work, but it is really slow (which I think is due to the large amount of rows of my_df)我已经尝试了下面的代码,它似乎有效,但它真的很慢(我认为这是由于 my_df 的行数很大)

for user_id in set(my_df.id):
    for day in set(my_df.loc[my_df.id == user_id, 'date']):
        my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'travelTime'] = max(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'leaving_time'].apply(pd.to_datetime)) - min(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'arriving_time'].apply(pd.to_datetime))

I think for correct maximal and minimal values are converted columns to datetimes and then subtract Series created by GroupBy.transform :我认为正确最大和最小值转换列,日期时间,然后减去Series创建由GroupBy.transform

my_df['s'] = pd.to_datetime(my_df['date'] + ' ' + my_df['arriving_time'])
my_df['e'] = pd.to_datetime(my_df['date'] + ' ' + my_df['leaving_time'])

g = my_df.groupby(['id', 'date'])
my_df['travelTime'] = g['e'].transform('max').sub(g['s'].transform('min'))
print (my_df)
  id        date arriving_time leaving_time                   s  \
0  a  2020/02/03      14:36:06     14:40:05 2020-02-03 14:36:06   
1  b  2020/04/05      08:52:02     08:52:41 2020-04-05 08:52:02   
2  b  2020/04/05      08:53:02     08:54:33 2020-04-05 08:53:02   
3  b  2020/04/05      08:55:24     08:57:14 2020-04-05 08:55:24   
4  b  2020/04/06      18:58:03     19:01:07 2020-04-06 18:58:03   
5  b  2020/04/06      19:03:05     19:04:08 2020-04-06 19:03:05   
6  c  2020/12/15      17:04:28     17:09:48 2020-12-15 17:04:28   
7  d  2020/06/23      21:31:23     21:50:12 2020-06-23 21:31:23   

                    e travelTime  
0 2020-02-03 14:40:05   00:03:59  
1 2020-04-05 08:52:41   00:05:12  
2 2020-04-05 08:54:33   00:05:12  
3 2020-04-05 08:57:14   00:05:12  
4 2020-04-06 19:01:07   00:06:05  
5 2020-04-06 19:04:08   00:06:05  
6 2020-12-15 17:09:48   00:05:20  
7 2020-06-23 21:50:12   00:18:49  

For avoid new columns is possible use DataFrame.assign Series with datetimes :为了避免新列,可以使用DataFrame.assign Series with datetimes

s = pd.to_datetime(my_df['date'] + ' ' + my_df['arriving_time'])
e = pd.to_datetime(my_df['date'] + ' ' + my_df['leaving_time'])

g = my_df.assign(s=s, e=e).groupby(['id', 'date'])
my_df['travelTime'] = g['e'].transform('max').sub(g['s'].transform('min'))
print (my_df)
  id        date arriving_time leaving_time travelTime
0  a  2020/02/03      14:36:06     14:40:05   00:03:59
1  b  2020/04/05      08:52:02     08:52:41   00:05:12
2  b  2020/04/05      08:53:02     08:54:33   00:05:12
3  b  2020/04/05      08:55:24     08:57:14   00:05:12
4  b  2020/04/06      18:58:03     19:01:07   00:06:05
5  b  2020/04/06      19:03:05     19:04:08   00:06:05
6  c  2020/12/15      17:04:28     17:09:48   00:05:20
7  d  2020/06/23      21:31:23     21:50:12   00:18:49

IIUC we first groupby id & date to get the max and min leave & arrival time. IIUC我们首先groupby iddate ,以获得最大和最小休假及到达时间。

then a simple subtraction.然后是一个简单的减法。

df2 = df.groupby(['id','date']).agg(min_arrival=('arriving_time','min'),
                             max_leave=('leaving_time','max'))


df2['travelTime'] =  pd.to_datetime(df2['max_leave']) - pd.to_datetime(df2['min_arrival']) 


print(df2)

              min_arrival max_leave travelTime
id date                                       
a  2020-02-03    14:36:06  14:40:05   00:03:59
b  2020-04-05    08:52:02  08:57:14   00:05:12
   2020-04-06    18:58:03  19:04:08   00:06:05
c  2020-12-15    17:04:28  17:09:48   00:05:20
d  2020-06-23    21:31:23  21:50:12   00:18:49

if you want this back on yout original df, you could use transform or merge the values from the new delta onto your original :如果您希望将其恢复到原始 df 上,您可以使用transform或将新增量中的值合并到您的原始值上:

df_new = (pd.merge(df,df2[['travelTime']],on=['date','id'],how='left')

  id       date arriving_time leaving_time   travelTime
0  a 2020-02-03      14:36:06     14:40:05     00:03:59
1  b 2020-04-05      08:52:02     08:52:41     00:05:12
2  b 2020-04-05      08:53:02     08:54:33     00:05:12
3  b 2020-04-05      08:55:24     08:57:14     00:05:12
4  b 2020-04-06      18:58:03     19:01:07     00:06:05
5  b 2020-04-06      19:03:05     19:04:08     00:06:05
6  c 2020-12-15      17:04:28     17:09:48     00:05:20
7  d 2020-06-23      21:31:23     21:50:12     00:18:49

You could try this -你可以试试这个——

my_df['arriving_time'] = pd.to_datetime(my_df['arriving_time'])
my_df['leaving_time'] = pd.to_datetime(my_df['leaving_time'])
my_df['travel_time'] = my_df.groupby(['id', 'date'])['leaving_time'].transform('max') - my_df.groupby(['id', 'date'])['arriving_time'].transform('min')
my_df
    id        date       arriving_time        leaving_time travel_time
0  a  2020/02/03 2020-03-19 14:36:06 2020-03-19 14:40:05    00:03:59
1  b  2020/04/05 2020-03-19 08:52:02 2020-03-19 08:52:41    00:05:12
2  b  2020/04/05 2020-03-19 08:53:02 2020-03-19 08:54:33    00:05:12
3  b  2020/04/05 2020-03-19 08:55:24 2020-03-19 08:57:14    00:05:12
4  b  2020/04/06 2020-03-19 18:58:03 2020-03-19 19:01:07    00:06:05
5  b  2020/04/06 2020-03-19 19:03:05 2020-03-19 19:04:08    00:06:05
6  c  2020/12/15 2020-03-19 17:04:28 2020-03-19 17:09:48    00:05:20
7  d  2020/06/23 2020-03-19 21:31:23 2020-03-19 21:50:12    00:18:49

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM