如何在 pandas 中删除重复项但保留比第一个更多

Question

Let's say I have a pandas DataFrame:假设我有一个 pandas DataFrame：

import pandas as pd

df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
   a
0  1
1  2
2  2
3  2
4  2
5  1
6  1
7  1
8  2
9  2

I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum.如果重复项超过某个阈值n ，我想删除它们并用该最小值替换它们。 Let's say that n=3 .假设n=3 。 Then, my target dataframe is然后，我的目标 dataframe 是

EDIT: Each set of consecutive repetitions is considered separately.编辑：每组连续重复被单独考虑。 In this example, rows 8 and 9 should be kept.在此示例中，应保留第 8 行和第 9 行。

Answer 1

Use boolean indexing with groupby.cumcount :将boolean 索引与groupby.cumcount一起使用：

N = 3
df[df.groupby('a').cumcount().lt(N)]

Output: Output：

For the last N:对于最后一个 N：

df[df.groupby('a').cumcount(ascending=False).lt(N)]

apply on consecutive repetitions适用于连续重复

df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])

Output: Output：

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1  # this is #3 of the local group
8  3
9  3

advantages of boolean indexing boolean分度的优点

You can use it for many other operations, such as setting values or masking:您可以将它用于许多其他操作，例如设置值或屏蔽：

group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)

df.where(m)
     a
0  1.0
1  2.0
2  2.0
3  2.0
4  NaN
5  1.0
6  1.0
7  1.0
8  3.0
9  3.0

df.loc[~m] = -1

   a
0  1
1  2
2  2
3  2
4 -1
5  1
6  1
7  1
8  3
9  3

Answer 2

You can create unique value for each consecutive group, then use groupby and head :您可以为每个连续组创建唯一值，然后使用groupby和head ：


group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)

# result:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  3
9  3

如何在 pandas 中删除重复项但保留比第一个更多

问题描述

2 个解决方案

解决方案1
2 2022-08-27 04:54:15

apply on consecutive repetitions适用于连续重复

advantages of boolean indexing boolean分度的优点

解决方案2
2 2022-08-27 05:07:45

如何在 pandas 中删除重复项但保留比第一个更多

问题描述

2 个解决方案

解决方案1 2 2022-08-27 04:54:15

apply on consecutive repetitions适用于连续重复

advantages of boolean indexing boolean分度的优点

解决方案2 2 2022-08-27 05:07:45

解决方案1
2 2022-08-27 04:54:15

解决方案2
2 2022-08-27 05:07:45