根据定义pandas中的类别的列过滤掉没有足够观察次数的DataFrame行

Question

我有一个DataFrame其中一列将数据集划分为一组类别。 我想删除那些观察数量较少的类别。

例

df = pd.DataFrame({'c': ['c1', 'c2', 'c1', 'c3', 'c4', 'c5', 'c2'], 'v': [5, 2, 7, 1, 2, 8, 3]})

    c  v
0  c1  5
1  c2  2
2  c1  7
3  c3  1
4  c4  2
5  c5  8
6  c2  3

对于列c和n = 2 ，除去所有具有小于行n列相同的值c ，导致：

Answer 1

使用pd.Series.value_counts通过随后布尔索引pd.Series.isin ：

counts = df['c'].value_counts()  # create series of counts
idx = counts[counts < 2].index   # filter for indices with < 2 counts

res = df[~df['c'].isin(idx)]     # filter dataframe

print(res)

    c  v
0  c1  5
1  c2  2
2  c1  7
6  c2  3

Answer 2

通过使用groupby这可以实现如下：

mask = df.groupby('c').count().reset_index()
mask = mask.loc[mask['v'] < 2]
res = df[~df.c.isin(mask.c.values)]
print(res)

输出：

根据定义pandas中的类别的列过滤掉没有足够观察次数的DataFrame行

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-09-11 09:40:00

解决方案2
1 2018-09-11 10:12:46

根据定义pandas中的类别的列过滤掉没有足够观察次数的DataFrame行

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-09-11 09:40:00

解决方案2 1 2018-09-11 10:12:46

解决方案1
2 已采纳 2018-09-11 09:40:00

解决方案2
1 2018-09-11 10:12:46