I have a dataframe that looks like this, but with a larger number of rows:
id status year
1 yes 2013
1 no 2013
1 yes 2014
3 no 2012
4 yes 2014
6 no 2014
I'd like to filter the dataframe so that if the id and year column are the same between two rows, but the status column is different, only the row with the 'yes' status remains. If there's a 'no' for an id and year combination that doesn't have a 'yes' associated with that, I'd still like to keep that. This leads me to the issue of not being able to just filter the status column to only have rows with 'yes'.
This leads me to the issue of not being able to just filter the status column to only have rows with 'yes'.
The resulting data frame should look like this, where the second row on the first data frame would be taken out because ID 1 and year 2013 has a 'yes' associated with it. However rows with IDs 3 and 6 remain because there is no yes associated with those ID and year combinations:
id status year
1 yes 2013
1 yes 2014
3 no 2012
4 yes 2014
6 no 2014
You can compute two conditions:
groupby
, transform
and nunique
, and OR the two masks, and filter on df
:
m1 = df.groupby(['id','year']).status.transform('nunique').eq(1)
m2 = df.status.eq('yes')
df[m1 | m2]
id status year
0 1 yes 2013
2 1 yes 2014
3 3 no 2012
4 4 yes 2014
5 6 no 2014
sort_values
+ drop_duplicates
This is a good opportunity to use Categorical Data . You can sort by status
and then remove duplicates by id
and year
:
df['status'] = pd.Categorical(df['status'], ordered=True, categories=['yes', 'no'])
res = df.sort_values('status').drop_duplicates(['id', 'year']).sort_index()
print(res)
id status year
0 1 yes 2013
2 1 yes 2014
3 3 no 2012
4 4 yes 2014
5 6 no 2014
Depending on your use case, the final sort by index may be unnecessary.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.