简体   繁体   中英

Remove all groups with more than N observations

If a value occurs more than two times in a column I want to drop every row that it occurs in.

The input df would look like:

Name   Num
  X     1
  X     2
  Y     3
  Y     4
  X     5

The output df would look like:

Name   Num
  Y     3
  Y     4

I know it is possible to remove duplicates, but that only works if I want to remove the first or last duplicate that is found, not the nth duplicate.

df = df.drop_duplicates(subset = ['Name'], drop='third')

This code is completely wrong but it helps explain what I was trying to do.

Use GroupBy.cumcount for counter and filter all values less like 2 :

df1 = df[df.groupby('Name').cumcount() < 3]
print (df1)
  Name  Num
0    X    1
1    X    2
2    Y    3
3    Y    4

Detail :

print (df.groupby('Name').cumcount())
0    0
1    1
2    0
3    1
4    2
dtype: int64

EDIT

Filter by GroupBy.transform and GroupBy.size :

df1 = df[df.groupby('Name')['Num'].transform('size') < 3]
print (df1)
  Name  Num
2    Y    3
3    Y    4

Using head

df.groupby('Name').head(2)
Out[375]: 
  Name  Num
0    X    1
1    X    2
2    Y    3
3    Y    4

s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]: 
  Name  Num
2    Y    3
3    Y    4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM