[英]Python: Identifying consecutive zeros in a column, delete their row and start new numbering
I have the following df and I want to split the data into trips.我有以下 df,我想将数据拆分为行程。
In: df = pd.DataFrame([[1001,0.054012973,0],[1001,0.44923679,12],[1001,0,1],[1001,0,1],[1001,0.44676617,2],
[1001,1.8310822,1],[1001,0,1],[1001,0,11],[1001,0,1],[1001,0,20],[1001,0,1],[1001,0,54],[1001,10.0604029,2],
[1001,11.642113,0],[1001,0,1],[1002,0,2],[1002,1.23463449,23],[1002,1.8310822,1],[1002,0,1]],
columns=['Dev_ID','Speed','Duration'])
out: Dev_ID Speed Duration
0 1001 0.054013 0
1 1001 0.449237 12
2 1001 0.000000 1
3 1001 0.000000 1
4 1001 0.446766 2
5 1001 1.831082 1
6 1001 0.000000 1
7 1001 0.000000 11
8 1001 0.000000 1
9 1001 0.000000 20
10 1001 0.000000 1
11 1001 0.000000 54
12 1001 10.060403 2
13 1001 11.642113 0
14 1001 0.000000 1
15 1002 0.000000 2
16 1002 1.234634 23
17 1002 1.831082 1
18 1002 0.000000 1
The criteria for the splitting is having a speed value of 0 longer than 120 sec.分裂的标准是速度值 0 长于 120 秒。 So I have to go for each dev_ID and somehow check if there are consecutive zeros that last more than 120 sec.所以我必须为每个dev_ID go 并以某种方式检查是否有持续超过120秒的连续零。 If the condition is true, I want to delete these rows (where zeros last more than 120 sec) and start a new id in the trip_ID column.如果条件为真,我想删除这些行(其中零持续超过 120 秒)并在 trip_ID 列中开始一个新的 id。 So the results should look like this:所以结果应该是这样的:
Dev_ID Speed Duration Trip_ID
0 1001 0.054013 0 10
1 1001 0.449237 12 10
2 1001 0.000000 1 10
3 1001 0.000000 1 10
4 1001 0.446766 2 10
5 1001 1.831082 1 10
6 1001 10.060403 2 11
7 1001 11.642113 0 11
8 1001 0.000000 1 11
9 1002 0.000000 2 12
10 1002 1.234634 23 12
11 1002 1.831082 1 12
12 1002 0.000000 1 12
I not totally sure I understood the condition, but I made a generic code that hopefully will be similar and you can adapt.我不完全确定我理解了这种情况,但我制作了一个通用代码,希望它会相似并且你可以适应。
The key ideas are: using pd.shift() to get the difference, use np.where to get a list of indexes where sequences of speed difference = 0, split those indexes in contigous groups with get_contigous_index, then for every contigous group if the sum of duration is > 120 then change 'Trip_id'关键思想是:使用 pd.shift() 获取差异,使用 np.where 获取速度差异序列 = 0 的索引列表,使用 get_contigous_index 将这些索引拆分为连续组,然后对于每个连续组,如果持续时间总和 > 120 然后更改“Trip_id”
I assumed your duration is in minutes, otherwise none of the intervals would be greater than 120我假设您的持续时间以分钟为单位,否则间隔都不会大于 120
import pandas as pd
import numpy as np
from itertools import groupby
from operator import itemgetter
df = pd.DataFrame([[1001,0.054012973,0],[1001,0.44923679,12],[1001,0,1],[1001,0,1],[1001,0.44676617,2],
[1001,1.8310822,1],[1001,0,1],[1001,0,11],[1001,0,1],[1001,0,20],[1001,0,1],[1001,0,54],[1001,10.0604029,2],
[1001,11.642113,0],[1001,0,1],[1002,0,2],[1002,1.23463449,23],[1002,1.8310822,1],[1002,0,1]],
columns=['Dev_ID','Speed','Duration'])
df['Duration'] = df['Duration']*60
df['Trip_ID'] = df['Dev_ID']
def get_contigous_index(indexes):
ranges = []
for k,g in groupby(enumerate(indexes),lambda x:x[0]-x[1]):
group = (map(itemgetter(1),g))
group = list(map(int,group))
ranges.append((group[0],group[-1]))
return ranges
for Dev_ID, data in df.groupby("Dev_ID"):
data['speed_diff'] = data['Speed'] - data['Speed'].shift(1)
diff_0 = np.where(data['speed_diff'] == 0)[0]
for contigousZeroes_range in get_contigous_index(diff_0):
fst_idx, lst_idx = list(contigousZeroes_range)
range_ = list(range(fst_idx,lst_idx+1))
subgroup = data.loc[range_ ,data.columns]
if not subgroup.empty:
if subgroup['Duration'].sum() > 120:
df.loc[range_,'Trip_ID'] = "a_different_id"
print(df)
this will print an dataframe like this:这将打印一个 dataframe 像这样:
Dev_ID Speed Duration Trip_ID
0 1001 0.054013 0 1001
1 1001 0.449237 720 1001
2 1001 0.000000 60 1001
3 1001 0.000000 60 1001
4 1001 0.446766 120 1001
5 1001 1.831082 60 1001
6 1001 0.000000 60 1001
7 1001 0.000000 660 a_different_id
8 1001 0.000000 60 a_different_id
9 1001 0.000000 1200 a_different_id
10 1001 0.000000 60 a_different_id
11 1001 0.000000 3240 a_different_id
12 1001 10.060403 120 1001
13 1001 11.642113 0 1001
14 1001 0.000000 60 1001
15 1002 0.000000 120 1002
16 1002 1.234634 1380 1002
17 1002 1.831082 60 1002
18 1002 0.000000 60 1002
Based on the suggested solution from @Dataman (many thanks) the code that worked for me is:根据@Dataman 建议的解决方案(非常感谢),对我有用的代码是:
for Dev_ID, data in df.groupby("Dev_ID"):
for k, g in groupby(data.iterrows(), lambda x: x[1]['Speed']): #group consecutive speeds
l = list(g)
if l[0][1]['Speed'] == 0: # check if the consective speeds are zeros
dur = sum(x[1]['Duration'] for x in l) # calculate how long speed 0 lasts
if dur>120:
zeros_idx.append([x[0] for x in l]) # save indexes where speed = 0 for long time
df.drop((item for sublist in zeros_idx for item in sublist),axis=0,inplace=True) #delete long stops
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.