简体   繁体   English

使用 pandas 在时间序列 dataframe 中删除 dataframe 行

[英]Dropping dataframe rows in time series dataframe using pandas

I have the below sequence of data as a pandas dataframe我有以下数据序列作为 pandas dataframe

id,start,end,duration
303,2012-06-25 17:59:43,2012-06-25 18:01:29,105
404,2012-06-25 18:01:29,2012-06-25 18:01:55,25
303,2012-06-25 18:01:56,2012-06-25 18:02:06,10
303,2012-06-25 18:02:23,2012-06-25 18:02:44,21
404,2012-06-25 18:02:45,2012-06-25 18:02:51,6
303,2012-06-25 18:02:54,2012-06-25 18:03:17,23
404,2012-06-25 18:03:24,2012-06-25 18:03:41,17
303,2012-06-25 18:03:43,2012-06-25 18:05:51,128
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104
404,2012-06-25 18:24:24,2012-06-25 18:25:25,61
101,2012-06-25 18:25:25,2012-06-25 18:25:462,21
404,2012-06-25 18:25:49,2012-06-25 18:26:00,11
101,2012-06-25 18:26:01,2012-06-25 18:26:04,3
404,2012-06-25 18:26:05,2012-06-25 18:28:49,164
202,2012-06-25 18:28:52,2012-06-25 18:28:57,5
404,2012-06-25 18:29:00,2012-06-25 18:29:24,24

It should always be the case that id 404 gets repeated after another different id.在另一个不同的 id 之后重复 id 404 应该总是如此。

For example if the above is motion sensors in a house eg 404 :hallway, 202 :bedroom, 303 :kitchen, 201 :studyroom, where the hallway is in the middle, then moving from bedroom to kitchen to studyroom and back to bedroom should trigger 202 , 404 , 303 , 404 , 201 , 404 , 202 in that order because one always passes through the hallway (404) to any room.例如,如果上面是房子中的运动传感器,例如404 :走廊, 202 :卧室, 303 :厨房, 201 :书房,走廊在中间,那么从卧室到厨房到书房再回到卧室应该触发202 , 404 , 303 , 404 , 201 , 404 , 202的顺序是因为人们总是通过走廊 (404) 到达任何房间。 My output has cases that violate this sequence and I want to drop such rows.我的 output 有违反此顺序的案例,我想删除此类行。

For example from the snippet dataframe above the below rows violate this:例如,来自以下行上方的片段 dataframe 违反了这一点:

303,2012-06-25 18:01:56,2012-06-25 18:02:06,10
303,2012-06-25 18:02:23,2012-06-25 18:02:44,21

303,2012-06-25 18:03:43,2012-06-25 18:05:51,128
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104

and therefore the rows below should be droped (but of course I have a much larger dataset).因此应该删除下面的行(但当然我有一个更大的数据集)。

303,2012-06-25 18:02:23,2012-06-25 18:02:44,21
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104

I have tried shift and drop but the result still has some inconsistencies.我尝试过 shift 和 drop,但结果仍然存在一些不一致之处。

df['id_ns'] = df['id'].shift(-1)
df['id_ps'] = df['id'].shift(1)

if (df['id'] != 404):
    df.drop(df[(df.id_ns != 404) & (df.id_ps != 404)].index, axis=0, inplace=True)

How best can I approach this?我怎样才能最好地解决这个问题?

Use Series.ne + Series.shift along with optional parameter fill_value to create a boolean mask , use this mask to filter/drop the rows:使用Series.ne + Series.shift以及可选参数fill_value创建 boolean mask ,使用此掩码过滤/删除行:

mask = df['id'].ne(404) & df['id'].shift(fill_value=404).ne(404)
df = df[~mask]

Result:结果:

print(df)
     id                start                  end  duration
0   303  2012-06-25 17:59:43  2012-06-25 18:01:29       105
1   404  2012-06-25 18:01:29  2012-06-25 18:01:55        25
2   303  2012-06-25 18:01:56  2012-06-25 18:02:06        10
4   404  2012-06-25 18:02:45  2012-06-25 18:02:51         6
5   303  2012-06-25 18:02:54  2012-06-25 18:03:17        23
6   404  2012-06-25 18:03:24  2012-06-25 18:03:41        17
7   303  2012-06-25 18:03:43  2012-06-25 18:05:51       128
9   404  2012-06-25 18:24:24  2012-06-25 18:25:25        61
10  101  2012-06-25 18:25:25  2012-06-25 18:25:46        21
11  404  2012-06-25 18:25:49  2012-06-25 18:26:00        11
12  101  2012-06-25 18:26:01  2012-06-25 18:26:04         3
13  404  2012-06-25 18:26:05  2012-06-25 18:28:49       164
14  202  2012-06-25 18:28:52  2012-06-25 18:28:57         5
15  404  2012-06-25 18:29:00  2012-06-25 18:29:24        24

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM