简体   繁体   中英

How to drop duplicates in pandas but keep more than the first

Let's say I have a pandas DataFrame:

import pandas as pd

df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
   a
0  1
1  2
2  2
3  2
4  2
5  1
6  1
7  1
8  2
9  2

I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3 . Then, my target dataframe is

>> df
   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  2
9  2

EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.

Use boolean indexing with groupby.cumcount :

N = 3
df[df.groupby('a').cumcount().lt(N)]

Output:

   a
0  1
1  2
2  2
3  2
5  1
6  1
8  3
9  3

For the last N:

df[df.groupby('a').cumcount(ascending=False).lt(N)]

apply on consecutive repetitions

df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])

Output:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1  # this is #3 of the local group
8  3
9  3

advantages of boolean indexing

You can use it for many other operations, such as setting values or masking:

group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)

df.where(m)
     a
0  1.0
1  2.0
2  2.0
3  2.0
4  NaN
5  1.0
6  1.0
7  1.0
8  3.0
9  3.0
df.loc[~m] = -1

   a
0  1
1  2
2  2
3  2
4 -1
5  1
6  1
7  1
8  3
9  3

You can create unique value for each consecutive group, then use groupby and head :


group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)

# result:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  3
9  3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM