简体   繁体   中英

Remove rows when the occurrence of a column value in the data frame is less than a certain number using pandas/python?

I have a data frame like this:

df
col1    col2
A         1
B         1
C         2
D         3
D         2
B         1
D         5

I have seen that col1 values with B and D occurs more than one times in the data frame.

I want to keep those values with occurrence more than one, the final data frame will look like:

col1     col2
 B         1
 D         3
 D         2
 B         1
 D         5

How to do this in most efficient way using pandas/python ?

You can use duplicated setting keep=False , which will return True for all duplicate values in col1 , and then simply use boolean indexation on the dataframe:

df[df.col1.duplicated(keep=False)]

   col1  col2
1    B     1
3    D     3
4    D     2
5    B     1
6    D     5

For keeping values where col1 occures more than thr times, use:

thr = 2
df[df.col1.duplicated(keep=False).groupby(df.col1).transform('sum').gt(thr)]

   col1  col2
3    D     3
4    D     2
6    D     5

Use DataFrame.duplicated with specify column col1 for search dupes with keep=False for return True s for all dupe rows, last filter by boolean indexing :

df = df[df.duplicated('col1', keep=False)]
print (df)
  col1  col2
1    B     1
3    D     3
4    D     2
5    B     1
6    D     5

If need specify threshold use transform with size and filter same way like first solution:

df = df[df.groupby('col1')['col1'].transform('size') > 1]
print (df)
  col1  col2
1    B     1
3    D     3
4    D     2
5    B     1
6    D     5

Alternative solution with value_counts and map :

df = df[df['col1'].map(df['col1'].value_counts()) > 1]

If performance is not important use DataFrameGroupBy.filter :

df = df.groupby('col1').filter(lambda x: len(x) > 1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM