[英]How to drop duplicates in pandas but keep more than the first
Let's say I have a pandas DataFrame:假设我有一个 pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
a
0 1
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 2
9 2
I want to drop duplicates if they exceed a certain threshold n
and replace them with that minimum.如果重复项超过某个阈值n
,我想删除它们并用该最小值替换它们。 Let's say that n=3
.假设n=3
。 Then, my target dataframe is然后,我的目标 dataframe 是
>> df
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 2
9 2
EDIT: Each set of consecutive repetitions is considered separately.编辑:每组连续重复被单独考虑。 In this example, rows 8 and 9 should be kept.在此示例中,应保留第 8 行和第 9 行。
Use boolean indexing with groupby.cumcount
:将boolean 索引与groupby.cumcount
一起使用:
N = 3
df[df.groupby('a').cumcount().lt(N)]
Output: Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
8 3
9 3
For the last N:对于最后一个 N:
df[df.groupby('a').cumcount(ascending=False).lt(N)]
df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])
Output: Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1 # this is #3 of the local group
8 3
9 3
You can use it for many other operations, such as setting values or masking:您可以将它用于许多其他操作,例如设置值或屏蔽:
group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)
df.where(m)
a
0 1.0
1 2.0
2 2.0
3 2.0
4 NaN
5 1.0
6 1.0
7 1.0
8 3.0
9 3.0
df.loc[~m] = -1
a
0 1
1 2
2 2
3 2
4 -1
5 1
6 1
7 1
8 3
9 3
You can create unique value for each consecutive group, then use groupby
and head
:您可以为每个连续组创建唯一值,然后使用groupby
和head
:
group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)
# result:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 3
9 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.