[英]pandas groupby filter, drop some group
I have groupby object 我有groupby对象
grouped = df.groupby('name')
for k,group in grouped:
print group
there are 3 groups bar , foo and foobar 有3组bar , foo和foobar
name time
2 bar 5
3 bar 6
name time
0 foo 5
1 foo 2
name time
4 foobar 20
5 foobar 1
I need to filter these groups and drop all groups which have no time greater than 5. In my example the group foo should be dropped. 我需要过滤这些组并删除所有时间不超过5的组。在我的示例中,应该删除组foo。 I am trying to do it with function filter()
我正在尝试使用功能filter()
grouped.filter(lambda x: (x.max()['time']>5))
but the x is obviously not only the group in dataframe format. 但是x显然不仅是数据帧格式的组。
Assuming your final line of code really should have a >5
rather than >20
, you would do something similar to: 假设您的最后一行代码实际上应该是
>5
而不是>20
,那么您将执行以下操作:
grouped.filter(lambda x: (x.time > 5).any())
As you correctly spotted x
is actually a DataFrame
for all indices where the name
column matches the key you have in k
in your for-loop. 正如您正确地发现的那样,
x
实际上是所有索引的DataFrame
,其中name
列与for循环中k
中的键匹配。
So you want to filter based on if there are any times larger than 5 in the time-column you do the above (x.time > 5).any()
to test it. 因此,您要根据时间列中是否有大于5的时间进行过滤,请执行上述
(x.time > 5).any()
进行测试。
I'm not used to python, numpy or pandas yet. 我还不习惯python,numpy或pandas。 But I was investigating a solution to a similar problem, so let me report my answers by taking this question as an example.
但是我正在研究类似问题的解决方案,所以让我以这个问题为例来报告我的答案。
import pandas as pd
df = pd.DataFrame()
df['name'] = ['foo', 'foo', 'bar', 'bar', 'foobar', 'foobar']
df['time'] = [5, 2, 5, 6, 20, 1]
grouped = df.groupby('name')
for k, group in grouped:
print(group)
indexes_should_drop = grouped.filter(lambda x: (x['time'].max() <= 5)).index
result1 = df.drop(index=indexes_should_drop)
filter_time_max = grouped['time'].max() > 5
groups_should_keep = filter_time_max.loc[filter_time_max].index
result2 = df.loc[df['name'].isin(groups_should_keep)]
filter_time_max = grouped['time'].max() <= 5
groups_should_drop = filter_time_max.loc[filter_time_max].index
result3 = df.drop(df[df['name'].isin(groups_should_drop)].index)
name time
2 bar 5
3 bar 6
4 foobar 20
5 foobar 1
My Answer1 doesn't use group names to drop groups. 我的Answer1不使用群组名称删除群组。 If you need group names, you can get them by writing:
df.loc[indexes_should_drop].name.unique()
. 如果需要组名,可以通过编写以下
df.loc[indexes_should_drop].name.unique()
获得它们: df.loc[indexes_should_drop].name.unique()
。
grouped['time'].max() <= 5
and grouped.apply(lambda x: (x['time'].max() <= 5)).index
returned same results. grouped['time'].max() <= 5
和grouped.apply(lambda x: (x['time'].max() <= 5)).index
返回相同的结果。
filter_time_max
's index was a group name. filter_time_max
的索引是组名。 It could not be used as an index or label to drop as it is. 它不能用作直接删除的索引或标签。
name
foo True
bar False
foobar False
Name: time, dtype: bool
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.