[英]Python: filter pandas dataframe to keep specified number of rows based on a column
[英]Filter out DataFrame rows that have insufficient number of observations based on a column defining a category in pandas
我有一个DataFrame
其中一列将数据集划分为一组类别。 我想删除那些观察数量较少的类别。
例
df = pd.DataFrame({'c': ['c1', 'c2', 'c1', 'c3', 'c4', 'c5', 'c2'], 'v': [5, 2, 7, 1, 2, 8, 3]})
c v
0 c1 5
1 c2 2
2 c1 7
3 c3 1
4 c4 2
5 c5 8
6 c2 3
对于列c
和n = 2
,除去所有具有小于行n
列相同的值c
,导致:
c v
0 c1 5
1 c2 2
2 c1 7
3 c2 3
使用pd.Series.value_counts
通过随后布尔索引pd.Series.isin
:
counts = df['c'].value_counts() # create series of counts
idx = counts[counts < 2].index # filter for indices with < 2 counts
res = df[~df['c'].isin(idx)] # filter dataframe
print(res)
c v
0 c1 5
1 c2 2
2 c1 7
6 c2 3
通过使用groupby
这可以实现如下:
mask = df.groupby('c').count().reset_index()
mask = mask.loc[mask['v'] < 2]
res = df[~df.c.isin(mask.c.values)]
print(res)
输出:
c v
0 c1 5
1 c2 2
2 c1 7
6 c2 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.