I have a data frame like this:
df
col1 col2
A 1
B 1
C 2
D 3
D 2
B 1
D 5
I have seen that col1 values with B and D occurs more than one times in the data frame.
I want to keep those values with occurrence more than one, the final data frame will look like:
col1 col2
B 1
D 3
D 2
B 1
D 5
How to do this in most efficient way using pandas/python ?
You can use duplicated
setting keep=False
, which will return True
for all duplicate values in col1
, and then simply use boolean indexation on the dataframe:
df[df.col1.duplicated(keep=False)]
col1 col2
1 B 1
3 D 3
4 D 2
5 B 1
6 D 5
For keeping values where col1
occures more than thr
times, use:
thr = 2
df[df.col1.duplicated(keep=False).groupby(df.col1).transform('sum').gt(thr)]
col1 col2
3 D 3
4 D 2
6 D 5
Use DataFrame.duplicated
with specify column col1
for search dupes with keep=False
for return True
s for all dupe rows, last filter by boolean indexing
:
df = df[df.duplicated('col1', keep=False)]
print (df)
col1 col2
1 B 1
3 D 3
4 D 2
5 B 1
6 D 5
If need specify threshold use transform
with size
and filter same way like first solution:
df = df[df.groupby('col1')['col1'].transform('size') > 1]
print (df)
col1 col2
1 B 1
3 D 3
4 D 2
5 B 1
6 D 5
Alternative solution with value_counts
and map
:
df = df[df['col1'].map(df['col1'].value_counts()) > 1]
If performance is not important use DataFrameGroupBy.filter
:
df = df.groupby('col1').filter(lambda x: len(x) > 1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.