简体   繁体   中英

Removing rows from a dataframe based on condition or value

Is there a way I can remove data from a df that has been grouped and sorted based on column values?

    id               time_stamp          df  rank
   002         2019-02-23 20:01:13.362  mdf   0
   002         2019-02-23 20:02:06.939  tof   1
   004         2019-03-01 02:30:33.332  mdf   0
   004         2019-03-01 02:34:21.134  tof   1

the data has been grouped by id column and sorted by ascending timestamp. I want to remove all rows or ids that do not have mdf as the value for rank 0, but not just that row, all other rows that are apart of that id as well.

For ex if 004 was not mdf for rank 0 I want to remove all 004s if that makes sense.

Thanks for looking!

You could use boolean masking:

mask = df['df'].ne('mdf') & df['rank'].eq(0)
excl_id = df.loc[mask, 'id'].unique()

df[~df['id'].isin(excl_id)]

Here my solution:

    data="""
id,time_stamp,df,rank
002,2019-02-23 20:01:13.362,mdf,0
002,2019-02-23 20:02:06.939,tof,1
004,2019-03-01 02:30:33.332,mdf,0
004,2019-03-01 02:34:21.134,tof,1
005,2019-03-01 02:35:21.134,mdf,1
005,2019-03-01 02:35:24.134,tof,1
   """
df = pd.read_csv(pd.compat.StringIO(data), sep=',')
print(df)

def process(x):   # the id 005 have to be deleted
    f = x[(x['df']=='mdf')& (x['rank'] == 0)]
    return not f.empty

df = df.groupby('id').filter(lambda x: process(x)).reset_index(drop=True)
print(df)

output:

   id               time_stamp   df  rank
0   2  2019-02-23 20:01:13.362  mdf     0
1   2  2019-02-23 20:02:06.939  tof     1
2   4  2019-03-01 02:30:33.332  mdf     0
3   4  2019-03-01 02:34:21.134  tof     1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM