简体   繁体   中英

How to filter repeat rows based on certain criteria

I have a dataframe that looks like this, but with a larger number of rows:

id         status       year
1           yes          2013
1           no           2013
1           yes          2014
3           no           2012
4           yes          2014
6           no           2014

I'd like to filter the dataframe so that if the id and year column are the same between two rows, but the status column is different, only the row with the 'yes' status remains. If there's a 'no' for an id and year combination that doesn't have a 'yes' associated with that, I'd still like to keep that. This leads me to the issue of not being able to just filter the status column to only have rows with 'yes'.

This leads me to the issue of not being able to just filter the status column to only have rows with 'yes'.

The resulting data frame should look like this, where the second row on the first data frame would be taken out because ID 1 and year 2013 has a 'yes' associated with it. However rows with IDs 3 and 6 remain because there is no yes associated with those ID and year combinations:

id         status       year
1           yes          2013
1           yes          2014
3           no           2012
4           yes          2014
6           no           2014

You can compute two conditions:

  1. One using groupby , transform and nunique , and
  2. The other pertaining to the status

OR the two masks, and filter on df :

m1 = df.groupby(['id','year']).status.transform('nunique').eq(1) 
m2 = df.status.eq('yes')
df[m1 | m2]

   id status  year
0   1    yes  2013
2   1    yes  2014
3   3     no  2012
4   4    yes  2014
5   6     no  2014

sort_values + drop_duplicates

This is a good opportunity to use Categorical Data . You can sort by status and then remove duplicates by id and year :

df['status'] = pd.Categorical(df['status'], ordered=True, categories=['yes', 'no'])

res = df.sort_values('status').drop_duplicates(['id', 'year']).sort_index()

print(res)

   id status  year
0   1    yes  2013
2   1    yes  2014
3   3     no  2012
4   4    yes  2014
5   6     no  2014

Depending on your use case, the final sort by index may be unnecessary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM