Is there a way I can remove data from a df that has been grouped and sorted based on column values?
id time_stamp df rank
002 2019-02-23 20:01:13.362 mdf 0
002 2019-02-23 20:02:06.939 tof 1
004 2019-03-01 02:30:33.332 mdf 0
004 2019-03-01 02:34:21.134 tof 1
the data has been grouped by id column and sorted by ascending timestamp. I want to remove all rows or ids that do not have mdf as the value for rank 0, but not just that row, all other rows that are apart of that id as well.
For ex if 004 was not mdf for rank 0 I want to remove all 004s if that makes sense.
Thanks for looking!
You could use boolean masking:
mask = df['df'].ne('mdf') & df['rank'].eq(0)
excl_id = df.loc[mask, 'id'].unique()
df[~df['id'].isin(excl_id)]
Here my solution:
data="""
id,time_stamp,df,rank
002,2019-02-23 20:01:13.362,mdf,0
002,2019-02-23 20:02:06.939,tof,1
004,2019-03-01 02:30:33.332,mdf,0
004,2019-03-01 02:34:21.134,tof,1
005,2019-03-01 02:35:21.134,mdf,1
005,2019-03-01 02:35:24.134,tof,1
"""
df = pd.read_csv(pd.compat.StringIO(data), sep=',')
print(df)
def process(x): # the id 005 have to be deleted
f = x[(x['df']=='mdf')& (x['rank'] == 0)]
return not f.empty
df = df.groupby('id').filter(lambda x: process(x)).reset_index(drop=True)
print(df)
output:
id time_stamp df rank
0 2 2019-02-23 20:01:13.362 mdf 0
1 2 2019-02-23 20:02:06.939 tof 1
2 4 2019-03-01 02:30:33.332 mdf 0
3 4 2019-03-01 02:34:21.134 tof 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.