熊猫groupby筛选器，删除一些组

Question

I have groupby object 我有groupby对象

grouped = df.groupby('name')
for k,group in grouped:    
    print group

there are 3 groups bar , foo and foobar 有3组bar ， foo和foobar

  name  time  
2  bar     5  
3  bar     6  


  name  time  
0  foo     5  
1  foo     2  

  name      time  
4  foobar     20  
5  foobar     1

I need to filter these groups and drop all groups which have no time greater than 5. In my example the group foo should be dropped. 我需要过滤这些组并删除所有时间不超过5的组。在我的示例中，应该删除组foo。 I am trying to do it with function filter() 我正在尝试使用功能filter（）

grouped.filter(lambda x: (x.max()['time']>5))

but the x is obviously not only the group in dataframe format. 但是x显然不仅是数据帧格式的组。

Answer 1

Assuming your final line of code really should have a >5 rather than >20 , you would do something similar to: 假设您的最后一行代码实际上应该是>5而不是>20 ，那么您将执行以下操作：

grouped.filter(lambda x: (x.time > 5).any())

As you correctly spotted x is actually a DataFrame for all indices where the name column matches the key you have in k in your for-loop. 正如您正确地发现的那样， x实际上是所有索引的DataFrame ，其中name列与for循环中k中的键匹配。

So you want to filter based on if there are any times larger than 5 in the time-column you do the above (x.time > 5).any() to test it. 因此，您要根据时间列中是否有大于5的时间进行过滤，请执行上述(x.time > 5).any()进行测试。

Answer 2

I'm not used to python, numpy or pandas yet. 我还不习惯python，numpy或pandas。 But I was investigating a solution to a similar problem, so let me report my answers by taking this question as an example. 但是我正在研究类似问题的解决方案，所以让我以这个问题为例来报告我的答案。

import pandas as pd

df = pd.DataFrame()
df['name'] = ['foo', 'foo', 'bar', 'bar', 'foobar', 'foobar']
df['time'] = [5, 2, 5, 6, 20, 1]

grouped = df.groupby('name')
for k, group in grouped:
    print(group)

My Answer 1: 我的答案1：

indexes_should_drop = grouped.filter(lambda x: (x['time'].max() <= 5)).index
result1 = df.drop(index=indexes_should_drop)

My Answer 2: 我的答案2：

filter_time_max = grouped['time'].max() > 5
groups_should_keep = filter_time_max.loc[filter_time_max].index
result2 = df.loc[df['name'].isin(groups_should_keep)]

My Answer 3: 我的答案3：

filter_time_max = grouped['time'].max() <= 5
groups_should_drop = filter_time_max.loc[filter_time_max].index
result3 = df.drop(df[df['name'].isin(groups_should_drop)].index)

Results 结果

    name    time
2   bar     5
3   bar     6
4   foobar  20
5   foobar  1

Points 点

My Answer1 doesn't use group names to drop groups. 我的Answer1不使用群组名称删除群组。 If you need group names, you can get them by writing: df.loc[indexes_should_drop].name.unique() . 如果需要组名，可以通过编写以下df.loc[indexes_should_drop].name.unique()获得它们： df.loc[indexes_should_drop].name.unique() 。

grouped['time'].max() <= 5 and grouped.apply(lambda x: (x['time'].max() <= 5)).index returned same results. grouped['time'].max() <= 5和grouped.apply(lambda x: (x['time'].max() <= 5)).index返回相同的结果。

filter_time_max 's index was a group name. filter_time_max的索引是组名。 It could not be used as an index or label to drop as it is. 它不能用作直接删除的索引或标签。

name
foo        True
bar       False
foobar    False
Name: time, dtype: bool

熊猫groupby筛选器，删除一些组

问题描述

2 个解决方案

解决方案1
1 2014-07-15 16:36:49

解决方案2
0 2019-09-01 11:39:45

My Answer 1: 我的答案1：

My Answer 2: 我的答案2：

My Answer 3: 我的答案3：

Results 结果

Points 点

熊猫groupby筛选器，删除一些组

问题描述

2 个解决方案

解决方案1 1 2014-07-15 16:36:49

解决方案2 0 2019-09-01 11:39:45

My Answer 1: 我的答案1：

My Answer 2: 我的答案2：

My Answer 3: 我的答案3：

Results 结果

Points 点

解决方案1
1 2014-07-15 16:36:49

解决方案2
0 2019-09-01 11:39:45