How to filter repeat rows based on certain criteria

Question

I have a dataframe that looks like this, but with a larger number of rows:

id         status       year
1           yes          2013
1           no           2013
1           yes          2014
3           no           2012
4           yes          2014
6           no           2014

I'd like to filter the dataframe so that if the id and year column are the same between two rows, but the status column is different, only the row with the 'yes' status remains. If there's a 'no' for an id and year combination that doesn't have a 'yes' associated with that, I'd still like to keep that. This leads me to the issue of not being able to just filter the status column to only have rows with 'yes'.

This leads me to the issue of not being able to just filter the status column to only have rows with 'yes'.

The resulting data frame should look like this, where the second row on the first data frame would be taken out because ID 1 and year 2013 has a 'yes' associated with it. However rows with IDs 3 and 6 remain because there is no yes associated with those ID and year combinations:

id         status       year
1           yes          2013
1           yes          2014
3           no           2012
4           yes          2014
6           no           2014

Answer 1

You can compute two conditions:

One using groupby , transform and nunique , and
The other pertaining to the status

OR the two masks, and filter on df :

m1 = df.groupby(['id','year']).status.transform('nunique').eq(1) 
m2 = df.status.eq('yes')
df[m1 | m2]

   id status  year
0   1    yes  2013
2   1    yes  2014
3   3     no  2012
4   4    yes  2014
5   6     no  2014

Answer 2

`sort_values` + `drop_duplicates`

This is a good opportunity to use Categorical Data . You can sort by status and then remove duplicates by id and year :

df['status'] = pd.Categorical(df['status'], ordered=True, categories=['yes', 'no'])

res = df.sort_values('status').drop_duplicates(['id', 'year']).sort_index()

print(res)

   id status  year
0   1    yes  2013
2   1    yes  2014
3   3     no  2012
4   4    yes  2014
5   6     no  2014

Depending on your use case, the final sort by index may be unnecessary.

How to filter repeat rows based on certain criteria

Question

2 answers

solution1
6 ACCPTED 2018-12-19 18:04:19

solution2
1 2018-12-19 18:08:36

`sort_values` + `drop_duplicates`

How to filter repeat rows based on certain criteria

Question

2 answers

solution1 6 ACCPTED 2018-12-19 18:04:19

solution2 1 2018-12-19 18:08:36

sort_values + drop_duplicates

solution1
6 ACCPTED 2018-12-19 18:04:19

solution2
1 2018-12-19 18:08:36

`sort_values` + `drop_duplicates`