简体   繁体   English

Python:识别列中的连续零,删除它们的行并开始新编号

[英]Python: Identifying consecutive zeros in a column, delete their row and start new numbering

I have the following df and I want to split the data into trips.我有以下 df,我想将数据拆分为行程。

In: df = pd.DataFrame([[1001,0.054012973,0],[1001,0.44923679,12],[1001,0,1],[1001,0,1],[1001,0.44676617,2],
[1001,1.8310822,1],[1001,0,1],[1001,0,11],[1001,0,1],[1001,0,20],[1001,0,1],[1001,0,54],[1001,10.0604029,2],
[1001,11.642113,0],[1001,0,1],[1002,0,2],[1002,1.23463449,23],[1002,1.8310822,1],[1002,0,1]],
columns=['Dev_ID','Speed','Duration'])

out:    Dev_ID  Speed   Duration
    0   1001    0.054013    0
    1   1001    0.449237    12
    2   1001    0.000000    1
    3   1001    0.000000    1
    4   1001    0.446766    2
    5   1001    1.831082    1
    6   1001    0.000000    1
    7   1001    0.000000    11
    8   1001    0.000000    1
    9   1001    0.000000    20
    10  1001    0.000000    1
    11  1001    0.000000    54
    12  1001    10.060403   2
    13  1001    11.642113   0
    14  1001    0.000000    1
    15  1002    0.000000    2
    16  1002    1.234634    23
    17  1002    1.831082    1
    18  1002    0.000000    1

The criteria for the splitting is having a speed value of 0 longer than 120 sec.分裂的标准是速度值 0 长于 120 秒。 So I have to go for each dev_ID and somehow check if there are consecutive zeros that last more than 120 sec.所以我必须为每个dev_ID go 并以某种方式检查是否有持续超过120秒的连续零。 If the condition is true, I want to delete these rows (where zeros last more than 120 sec) and start a new id in the trip_ID column.如果条件为真,我想删除这些行(其中零持续超过 120 秒)并在 trip_ID 列中开始一个新的 id。 So the results should look like this:所以结果应该是这样的:

    Dev_ID  Speed   Duration    Trip_ID
0   1001    0.054013    0   10
1   1001    0.449237    12  10
2   1001    0.000000    1   10
3   1001    0.000000    1   10
4   1001    0.446766    2   10
5   1001    1.831082    1   10
6   1001    10.060403   2   11
7   1001    11.642113   0   11
8   1001    0.000000    1   11
9   1002    0.000000    2   12
10  1002    1.234634    23  12
11  1002    1.831082    1   12
12  1002    0.000000    1   12

I not totally sure I understood the condition, but I made a generic code that hopefully will be similar and you can adapt.我不完全确定我理解了这种情况,但我制作了一个通用代码,希望它会相似并且你可以适应。

The key ideas are: using pd.shift() to get the difference, use np.where to get a list of indexes where sequences of speed difference = 0, split those indexes in contigous groups with get_contigous_index, then for every contigous group if the sum of duration is > 120 then change 'Trip_id'关键思想是:使用 pd.shift() 获取差异,使用 np.where 获取速度差异序列 = 0 的索引列表,使用 get_contigous_index 将这些索引拆分为连续组,然后对于每个连续组,如果持续时间总和 > 120 然后更改“Trip_id”

I assumed your duration is in minutes, otherwise none of the intervals would be greater than 120我假设您的持续时间以分钟为单位,否则间隔都不会大于 120

import pandas as pd
import numpy as np
from itertools import groupby
from operator import itemgetter
df = pd.DataFrame([[1001,0.054012973,0],[1001,0.44923679,12],[1001,0,1],[1001,0,1],[1001,0.44676617,2],
[1001,1.8310822,1],[1001,0,1],[1001,0,11],[1001,0,1],[1001,0,20],[1001,0,1],[1001,0,54],[1001,10.0604029,2],
[1001,11.642113,0],[1001,0,1],[1002,0,2],[1002,1.23463449,23],[1002,1.8310822,1],[1002,0,1]],
columns=['Dev_ID','Speed','Duration'])
df['Duration'] = df['Duration']*60
df['Trip_ID'] = df['Dev_ID']

def get_contigous_index(indexes):
    ranges = []
    for k,g in groupby(enumerate(indexes),lambda x:x[0]-x[1]):
        group = (map(itemgetter(1),g))
        group = list(map(int,group))
        ranges.append((group[0],group[-1]))
    return ranges

for Dev_ID, data in df.groupby("Dev_ID"):
    data['speed_diff'] = data['Speed'] - data['Speed'].shift(1)
    diff_0 = np.where(data['speed_diff'] == 0)[0]

    for contigousZeroes_range in get_contigous_index(diff_0):
        fst_idx, lst_idx = list(contigousZeroes_range)
        range_ = list(range(fst_idx,lst_idx+1))
        subgroup = data.loc[range_ ,data.columns]
        if not subgroup.empty:
            if subgroup['Duration'].sum() > 120:
                df.loc[range_,'Trip_ID'] = "a_different_id"
print(df)

this will print an dataframe like this:这将打印一个 dataframe 像这样:

    Dev_ID      Speed  Duration         Trip_ID
0     1001   0.054013         0            1001
1     1001   0.449237       720            1001
2     1001   0.000000        60            1001
3     1001   0.000000        60            1001
4     1001   0.446766       120            1001
5     1001   1.831082        60            1001
6     1001   0.000000        60            1001
7     1001   0.000000       660  a_different_id
8     1001   0.000000        60  a_different_id
9     1001   0.000000      1200  a_different_id
10    1001   0.000000        60  a_different_id
11    1001   0.000000      3240  a_different_id
12    1001  10.060403       120            1001
13    1001  11.642113         0            1001
14    1001   0.000000        60            1001
15    1002   0.000000       120            1002
16    1002   1.234634      1380            1002
17    1002   1.831082        60            1002
18    1002   0.000000        60            1002

Based on the suggested solution from @Dataman (many thanks) the code that worked for me is:根据@Dataman 建议的解决方案(非常感谢),对我有用的代码是:

for Dev_ID, data in df.groupby("Dev_ID"): 
    for k, g in groupby(data.iterrows(), lambda x: x[1]['Speed']): #group consecutive speeds
        l = list(g)
        if l[0][1]['Speed'] == 0: # check if the consective speeds are zeros
           dur = sum(x[1]['Duration'] for x in l) # calculate how long speed 0 lasts
           if dur>120:
              zeros_idx.append([x[0] for x in l]) # save indexes where speed = 0 for long time
df.drop((item for sublist in zeros_idx for item in sublist),axis=0,inplace=True) #delete long stops

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM