简体   繁体   English

如何在 pandas 中删除重复项但保留比第一个更多

[英]How to drop duplicates in pandas but keep more than the first

Let's say I have a pandas DataFrame:假设我有一个 pandas DataFrame:

import pandas as pd

df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
   a
0  1
1  2
2  2
3  2
4  2
5  1
6  1
7  1
8  2
9  2

I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum.如果重复项超过某个阈值n ,我想删除它们并用该最小值替换它们。 Let's say that n=3 .假设n=3 Then, my target dataframe is然后,我的目标 dataframe 是

>> df
   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  2
9  2

EDIT: Each set of consecutive repetitions is considered separately.编辑:每组连续重复被单独考虑。 In this example, rows 8 and 9 should be kept.在此示例中,应保留第 8 行和第 9 行。

Use boolean indexing with groupby.cumcount :boolean 索引groupby.cumcount一起使用:

N = 3
df[df.groupby('a').cumcount().lt(N)]

Output: Output:

   a
0  1
1  2
2  2
3  2
5  1
6  1
8  3
9  3

For the last N:对于最后一个 N:

df[df.groupby('a').cumcount(ascending=False).lt(N)]

apply on consecutive repetitions适用于连续重复

df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])

Output: Output:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1  # this is #3 of the local group
8  3
9  3

advantages of boolean indexing boolean分度的优点

You can use it for many other operations, such as setting values or masking:您可以将它用于许多其他操作,例如设置值或屏蔽:

group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)

df.where(m)
     a
0  1.0
1  2.0
2  2.0
3  2.0
4  NaN
5  1.0
6  1.0
7  1.0
8  3.0
9  3.0
df.loc[~m] = -1

   a
0  1
1  2
2  2
3  2
4 -1
5  1
6  1
7  1
8  3
9  3

You can create unique value for each consecutive group, then use groupby and head :您可以为每个连续组创建唯一值,然后使用groupbyhead


group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)

# result:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  3
9  3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何删除重复项并保持熊猫的第一价值? - How do I drop duplicates and keep the first value on pandas? 熊猫:如何按列选择第一个或最后一个与 drop_duplicates 保持一致 - pandas: how to select first or last by column in keep with drop_duplicates Pandas - 与删除重复项相反,先保留 - Pandas - Opposite of drop duplicates, keep first 如何删除重复项但首先保留在 pyspark dataframe 中? - how to drop duplicates but keep first in pyspark dataframe? Pandas - 删除重复项但根据列中的值更改 keep:first/last - Pandas - Drop duplicates but change keep:first/last according to a value in a column 如果重复超过 n 次,则删除 Pandas dataframe 中的连续重复项 - Drop consecutive duplicates in Pandas dataframe if repeated more than n times 如何在熊猫数据框中保留前两个重复项? - How to keep first two duplicates in a pandas dataframe? 如何删除重复项,但在熊猫中保留第一个实例并保留重复项的空白单元格? - How to delete duplicates, but keep the first instance and a blank cell for the duplicates in Pandas? 平均 pandas DataFrame 中的重复项,而不是使用 drop_duplicates 来保持第一 - Averaging duplicates in a pandas DataFrame instead of using drop_duplicates to keep first 如何删除重复项并保留熊猫的最后一个时间戳 - How do I drop duplicates and keep the last timestamp on pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM