pandas 过滤来自 dataframe 的行，连续差 < n

Question

I have a pandas dataframe as such:我有一个pandas dataframe 这样的：

id         time
1             1
2             3
3             4
4             5
5             8
6             8

and I want to drop rows that are less than 2 seconds apart.我想删除相隔不到 2 秒的行。 I started by computing the time diff between consecutive rows and adding it as a column:我首先计算连续行之间的时间差异并将其添加为列：

df['time_since_last_detect'] = df.time.diff().fillna(0)

resulting in:导致：

id         time       time_since_last_detect
1             1                            0
2             3                            2
3             4                            1
4             5                            1
5             8                            3
6             8                            0

and then filtering the rows using df[df.time_since_last_detect > 1] , which results in:然后使用df[df.time_since_last_detect > 1]过滤行，结果是：

id         time       time_since_last_detect
2             3                            2
5             8                            3

The problem with this, however, is it does not recompute the difference from the new previous row once a row is dropped.但是，这样做的问题是，一旦删除了一行，它就不会重新计算与新的前一行的差异。 For example, after removing the first and third rows, the difference between the second and the fourth will be 2. But the fourth row will be removed with this filter nevertheless, which I don't want to happen.例如，删除第一行和第三行后，第二行和第四行之间的差值为 2。但第四行仍然会被此过滤器删除，我不希望发生这种情况。 What is the best way to solve this problem?解决这个问题的最佳方法是什么？ This is the desired result I'm trying to achieve:这是我想要达到的理想结果：

id         time       time_since_last_detect
2             3                            2
4             5                            1
5             8                            3

Answer 1

Not a perfect solution but you can do below in your case.这不是一个完美的解决方案，但您可以根据自己的情况执行以下操作。 Need to modify below to make a generic function.需要在下面修改以制作通用的 function。

import pandas as pd

d = {'id' : [1,2,3,4,5,6], 'time' : [1,3,4,5,8,8]}
df = pd.DataFrame(data =d)

df['time_since_last_detect'] = df.time.diff().fillna(0)
timeperiod = 2

df['time_since_last_sum'] =  df['time_since_last_detect'].rolling(min_periods=1, window=timeperiod).sum().fillna(0) # gets sum of rolling period , in this case 2. One case change as needed

df_final =  df.loc[(df['time_since_last_detect'] >= 2) | (df['time_since_last_sum'] == 2)] # Filter data with 2 OR condition 1. If last_detect>2 or last of 2 rolling period is 2

Output: Output：

   id  time  time_since_last_detect  time_since_last_sum
   2     3                     2.0                  2.0
   4     5                     1.0                  2.0
   5     8                     3.0                  4.0

pandas 过滤来自 dataframe 的行，连续差 < n

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-04-22 15:12:49

pandas 过滤来自 dataframe 的行，连续差 &lt; n

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-04-22 15:12:49

pandas 过滤来自 dataframe 的行，连续差 < n

解决方案1
2 已采纳 2020-04-22 15:12:49