[英]Remove rows when the occurrence of a column value in the data frame is less than a certain number using pandas/python?
I have a data frame like this: 我有一个这样的数据框:
df
col1 col2
A 1
B 1
C 2
D 3
D 2
B 1
D 5
I have seen that col1 values with B and D occurs more than one times in the data frame. 我已经看到,在数据框中,带有B和D的col1值出现的次数超过一次。
I want to keep those values with occurrence more than one, the final data frame will look like: 我希望将这些值保留为多于一个,最终的数据框将如下所示:
col1 col2
B 1
D 3
D 2
B 1
D 5
How to do this in most efficient way using pandas/python ? 如何使用pandas / python以最有效的方式执行此操作?
You can use duplicated
setting keep=False
, which will return True
for all duplicate values in col1
, and then simply use boolean indexation on the dataframe: 您可以使用
duplicated
设置keep=False
,它将为col1
所有重复值返回True
,然后在数据帧上使用boolean indexation:
df[df.col1.duplicated(keep=False)]
col1 col2
1 B 1
3 D 3
4 D 2
5 B 1
6 D 5
Update
更新
For keeping values where col1
occures more than thr
times, use: 要保持
col1
发生的次数超过thr
次数,请使用:
thr = 2
df[df.col1.duplicated(keep=False).groupby(df.col1).transform('sum').gt(thr)]
col1 col2
3 D 3
4 D 2
6 D 5
Use DataFrame.duplicated
with specify column col1
for search dupes with keep=False
for return True
s for all dupe rows, last filter by boolean indexing
: 使用
DataFrame.duplicated
,为搜索对象指定列col1
,使用keep=False
,返回所有dupe行的True
,最后按boolean indexing
过滤:
df = df[df.duplicated('col1', keep=False)]
print (df)
col1 col2
1 B 1
3 D 3
4 D 2
5 B 1
6 D 5
If need specify threshold use transform
with size
and filter same way like first solution: 如果需要指定阈值使用
transform
size
和过滤器相同的方式像第一个解决方案:
df = df[df.groupby('col1')['col1'].transform('size') > 1]
print (df)
col1 col2
1 B 1
3 D 3
4 D 2
5 B 1
6 D 5
Alternative solution with value_counts
and map
: 使用
value_counts
和map
替代解决方案:
df = df[df['col1'].map(df['col1'].value_counts()) > 1]
If performance is not important use DataFrameGroupBy.filter
: 如果性能不重要,请使用
DataFrameGroupBy.filter
:
df = df.groupby('col1').filter(lambda x: len(x) > 1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.