简体   繁体   中英

Drop rows from original Pandas dataframe if some GroupBy condition is met

I'm trying to construct a fast Pandas approach for dropping certain rows from a Dataframe when some condition is met. Specifically, I want to drop the first occurrence of some variable in the dataframe if some other value in that row is equal to 0. This is perhaps easiest explained by example:

foo = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
bar = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1])
df = pd.DataFrame({'foo': foo, 'bar':bar}) 

# So df is:
idx | foo | bar
0   1   1
1   1   0
2   1   1
3   1   0
4   1   1
5   1   0
6   1   1
7   1   0
8   1   1
9   1   0
10  1   1
11  2   0
12  2   1
13  2   0
14  2   1
15  3   1
16  3   1
17  3   0
18  3   1

I want to look at the first row when the 'foo' column is a new value, then drop it from the dataframe if the 'bar' value in that row = 0.

I can find when this condition is met using groupby:

df.groupby('foo').first()

# Result:
    bar
foo 
1   1
2   0
3   1

So I see that I need to drop the first row when foo = 2 (ie just drop row with index = 11 in my original data frame). I cannot work out, however, how to use this groupby result as a mask for my original data frame, since the shapes / sizes are different.

I found a related question on groupby modifications ( Drop pandas dataframe rows based on groupby() condition ), but in this example they drop ALL rows when this condition is met, whereas I only want to drop the first row.

Is this possible please?

Use Series.shift :

df.loc[~(df['foo'].ne(df['foo'].shift()) & df['bar'].eq(0))]

or

df.loc[df.duplicated(subset = 'foo') | df['bar'].ne(0)]

clearly much better

%%timeit
df.loc[~(df['foo'].ne(df['foo'].shift()) & df['bar'].eq(0))]
#970 µs ± 51.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) each)

%%timeit
df.loc[df.duplicated(subset = 'foo') | df['bar'].ne(0)]
#1.34 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.loc[~df.index.isin(df.drop_duplicates(subset='foo').loc[lambda x: x.bar==0].index)]
#2.16 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

if foo is like in your example:

%%timeit
df.loc[~(df['foo'].diff().ne(0)&df['bar'].eq(0))]
908 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

or

%%timeit
df.loc[df['foo'].duplicated().add(df['bar']).ne(0)]
787 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can first find the first occurence of each new foo, check if bar is 0, then use it as a mask to filter the original df.

df.loc[~df.index.isin(df.drop_duplicates(subset='foo').loc[lambda x: x.bar==0].index)]

Or to use groupby:

(
    df.groupby('foo').apply(lambda x: x.iloc[int(x.bar.iloc[0]==0):])
    .reset_index(level=0,drop=True)
)

First approach is faster (2.71 ms) than the groupby method(3.93 ms) with your example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM