简体   繁体   中英

Drop rows at beginning of group with a specific value in pandas groupby

I have a pandas (multi-index) dataframe as follows:

             date       Volume
Account ID                     
10001   2  02-03-2017   0
        3  02-03-2017   0
        3  09-03-2017   0
        3  16-03-2017   50
        3  21-03-2017   65
        3  28-03-2017   0
        3  04-04-2017   0
        3  11-04-2017   60
10002   5  02-03-2017   14.5
        6  09-03-2017   14.5
        3  09-03-2017   0
        3  16-03-2017   0
        3  21-03-2017   20
        3  28-03-2017   33
10003   8  21-03-2017   14.5
        9  28-03-2017   15.0

Now I want to delete all rows at the beginning of a series (dates of an account-product combination) with volume 0. So I want to keep the rows with volume 0 in case they are in the middle or at the end of a series.

So in the above example, I'd want the following output:

             date       Volume
Account ID                     
10001   3  16-03-2017   50
        3  21-03-2017   65
        3  28-03-2017   0
        3  04-04-2017   0
        3  11-04-2017   60
10002   5  02-03-2017   14.5
        6  09-03-2017   14.5
        3  21-03-2017   20
        3  28-03-2017   33
10003   8  21-03-2017   14.5
        9  28-03-2017   15.0

Currently, I've been removing complete series with a filter, eg

df = data.groupby(level = acc_prod).filter(lambda x: len(x) > 26)

And I've seen examples of removing only the first row; Python: Pandas - Delete the first row by group . Yet I do not know how to only delete rows of zero at the beginning of an account-product series.

Would be great if someone could help me out on this!

You can use boolean indexing with mask created by groupby with cumsum and find values which are not 0 :

print (df.groupby(level=['Account','ID'])['Volume'].cumsum())
Account  ID
10001    2       0.0
         3       0.0
         3       0.0
         3      50.0
         3     115.0
         3     115.0
         3     115.0
         3     175.0
10002    5      14.5
         6      14.5
         3       0.0
         3       0.0
         3      20.0
         3      53.0
10003    8      14.5
         9      15.0
Name: Volume, dtype: float64
mask = df.groupby(level=['Account','ID'])['Volume'].cumsum() != 0
#!= is same as ne function
#mask = df.groupby(level=['Account','ID'])['Volume'].cumsum().ne(0)
print (mask)
Account  ID
10001    2     False
         3     False
         3     False
         3      True
         3      True
         3      True
         3      True
         3      True
10002    5      True
         6      True
         3     False
         3     False
         3      True
         3      True
10003    8      True
         9      True
Name: Volume, dtype: bool
print (df[mask])
                  date  Volume
Account ID                    
10001   3   16-03-2017    50.0
        3   21-03-2017    65.0
        3   28-03-2017     0.0
        3   04-04-2017     0.0
        3   11-04-2017    60.0
10002   5   02-03-2017    14.5
        6   09-03-2017    14.5
        3   21-03-2017    20.0
        3   28-03-2017    33.0
10003   8   21-03-2017    14.5
        9   28-03-2017    15.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM