简体   繁体   中英

Python Pandas groupby: filter according to condition on values

Consider a dataframe like the following.

import pandas as pd

# Initialize dataframe
df1 = pd.DataFrame(columns=['bar', 'foo'])
df1['bar'] = ['001', '001', '001', '001', '002', '002', '003', '003', '003']
df1['foo'] = [-1, 0, 2, 3, -8, 1, 0, 1, 2]
>>> print df1
   bar  foo
0  001   -1
1  001    0
2  001    2
3  001    3
4  002   -8
5  002    1
6  003    0
7  003    1
8  003    2

# Lower and upper bound for desired range
lower_bound = -5
upper_bound = 5

I would like to use groupby in Pandas to return a dataframe that filters out rows with an bar that meets a condition. In particular, I would like to filter out rows with an bar if one of the values of foo for this bar is not between lower_bound and upper_bound .

In the above example, rows with bar = 002 should be filtered out since not all of the rows with bar = 002 contain a value of foo between -5 and 5 (namely, row index 4 contains foo = -8 ). The desired output for this example is the following.

# Desired output
   bar  foo
0  001   -1
1  001    0
2  001    2
3  001    3
6  003    0
7  003    1
8  003    2

I have tried the following approach.

# Attempted solution
grouped = df1.groupby('bar')['foo']
grouped.filter(lambda x: x < lower_bound or x > upper_bound)

However, this yields a TypeError: the filter must return a boolean result . Furthermore, this approach might return a groupby object, when I want the result to return a dataframe object.

Most likely you will not use and and or but vectorized & and | with pandas , and for your case, then apply all() function in the filter to construct the boolean condition, this keeps bar where all corresponding foo values are between lower_bound and upper_bound :

df1.groupby('bar').filter(lambda x: ((x.foo >= lower_bound) & (x.foo <= upper_bound)).all())

#   bar foo
#0  001 -1
#1  001  0
#2  001  2
#3  001  3
#6  003  0
#7  003  1
#8  003  2

Psidom's answer works fine, but can be slow on large datasets. Mine is somewhat of a workaround, but it is fast.

df1['conditions_apply'] = (df1.foo >= lower_bound) & (df1.foo <= upper_bound)
selection = df1.groupby('bar')['conditions_apply'].min()  # any False will return False
selection = selection[selection].index.tolist()           # get all bars with Trues
df1 = df1[df1.bar.isin(selection)]                        # make selection
df1.drop(columns=['conditions_apply'], inplace=True)      # drop newly made column

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM