简体   繁体   English

pandas 过滤来自 dataframe 的行,连续差 < n

[英]pandas filter rows from dataframe with consecutive difference < n

I have a pandas dataframe as such:我有一个pandas dataframe 这样的:

id         time
1             1
2             3
3             4
4             5
5             8
6             8

and I want to drop rows that are less than 2 seconds apart.我想删除相隔不到 2 秒的行。 I started by computing the time diff between consecutive rows and adding it as a column:我首先计算连续行之间的时间差异并将其添加为列:

df['time_since_last_detect'] = df.time.diff().fillna(0)

resulting in:导致:

id         time       time_since_last_detect
1             1                            0
2             3                            2
3             4                            1
4             5                            1
5             8                            3
6             8                            0

and then filtering the rows using df[df.time_since_last_detect > 1] , which results in:然后使用df[df.time_since_last_detect > 1]过滤行,结果是:

id         time       time_since_last_detect
2             3                            2
5             8                            3

The problem with this, however, is it does not recompute the difference from the new previous row once a row is dropped.但是,这样做的问题是,一旦删除了一行,它就不会重新计算与新的前一行的差异。 For example, after removing the first and third rows, the difference between the second and the fourth will be 2. But the fourth row will be removed with this filter nevertheless, which I don't want to happen.例如,删除第一行和第三行后,第二行和第四行之间的差值为 2。但第四行仍然会被此过滤器删除,我不希望发生这种情况。 What is the best way to solve this problem?解决这个问题的最佳方法是什么? This is the desired result I'm trying to achieve:这是我想要达到的理想结果:

id         time       time_since_last_detect
2             3                            2
4             5                            1
5             8                            3

Not a perfect solution but you can do below in your case.这不是一个完美的解决方案,但您可以根据自己的情况执行以下操作。 Need to modify below to make a generic function.需要在下面修改以制作通用的 function。

import pandas as pd

d = {'id' : [1,2,3,4,5,6], 'time' : [1,3,4,5,8,8]}
df = pd.DataFrame(data =d)

df['time_since_last_detect'] = df.time.diff().fillna(0)
timeperiod = 2

df['time_since_last_sum'] =  df['time_since_last_detect'].rolling(min_periods=1, window=timeperiod).sum().fillna(0) # gets sum of rolling period , in this case 2. One case change as needed

df_final =  df.loc[(df['time_since_last_detect'] >= 2) | (df['time_since_last_sum'] == 2)] # Filter data with 2 OR condition 1. If last_detect>2 or last of 2 rolling period is 2 

Output: Output:

   id  time  time_since_last_detect  time_since_last_sum
   2     3                     2.0                  2.0
   4     5                     1.0                  2.0
   5     8                     3.0                  4.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM