简体   繁体   English

如果重复超过 n 次,则删除 Pandas dataframe 中的连续重复项

[英]Drop consecutive duplicates in Pandas dataframe if repeated more than n times

Building off the question/solution here , I'm trying to set a parameter that will only remove consecutive duplicates if the same value occurs 5 (or more) times consecutively... 在此处建立问题/解决方案,我正在尝试设置一个参数,如果相同的值连续出现 5 次(或更多)次,则该参数只会删除连续的重复项......

I'm able to apply the solution in the linked post which uses .shift() to check if the previous (or a specified value in the past or future by adjusting the shift periods parameter) equals the current value, but how could I adjust this to check several consecutive values simultaneously?我可以在链接的帖子中应用解决方案,它使用.shift()来检查以前的(或通过调整班次周期参数在过去或未来指定的值)是否等于当前值,但我该如何调整这要同时检查几个连续的值?

Suppose a dataframe that looks like this:假设 dataframe 看起来像这样:

x    y

1    2
2    2
3    3
4    3
5    3
6    3
7    3
8    4
9    4
10   4
11   4
12   2

I'm trying to achieve this:我正在努力实现这一目标:

x    y

1    2
2    2
3    3
8    4
9    4
10   4
11   4
12   2

Where we lose rows 4,5,6,7 because we found five consecutive 3's in the y column.我们丢失了第 4、5、6、7 行,因为我们在 y 列中找到了五个连续的 3。 But keep rows 1,2 because it we only find two consecutive 2's in the y column.但是保留第 1,2 行,因为我们只能在 y 列中找到两个连续的 2。 Similarly, keep rows 8,9,10,11 because we only find four consecutive 4's in the y column.同样,保留第 8、9、10、11 行,因为我们只在 y 列中找到四个连续的 4。

Let's try cumsum on the differences to find the consecutive blocks.让我们尝试对差异进行cumsum以找到连续的块。 Then groupby().transform('size') to get the size of the blocks:然后groupby().transform('size')得到块的大小:

thresh = 5
s = df['y'].diff().ne(0).cumsum()

small_size = s.groupby(s).transform('size') < thresh
first_rows = ~s.duplicated()

df[small_size | first_rows]

Output: Output:

     x  y
0    1  2
1    2  2
2    3  3
7    8  4
8    9  4
9   10  4
10  11  4
11  12  2

Not straight forward, I would go with @Quang Hoang不是直截了当,我会 go 和@Quang Hoang

Create a column which gives the number of times a values is duplicated.创建一个列,该列给出重复值的次数。 In this case I used np.where() and df.duplicated() and assigned any count> 4 to be NaN在这种情况下,我使用np.where()df.duplicated()并将任何count> 4分配为NaN

df['g']=np.where(df.groupby('y').transform(lambda x: x.duplicated(keep='last').count())>4, np.nan,1)

I then create two dataframes.然后我创建两个数据框。 One where I drop all the NaNs and one with only NaNs .一种是我丢弃所有NaNs ,另一种是只删除NaNs In the one with NaNs , I drop all apart from the last index using .last_valid_index() .在带有NaNs的那个中,我使用.last_valid_index()删除了最后一个索引之外的所有内容。 I then append them and sort by index using .sort_index() .然后我 append 它们并使用.sort_index()按索引排序。 I use iloc[:,:2]) to slice out new column I created above我使用iloc[:,:2])来切出我在上面创建的新列

df.dropna().append(df.loc[df[df.g.isna()].last_valid_index()]).sort_index().iloc[:,:2]

     x    y
0    1.0  2.0
1    2.0  2.0
6    7.0  3.0
7    8.0  4.0
8    9.0  4.0
9   10.0  4.0
10  11.0  4.0
11  12.0  2.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM