简体   繁体   中英

Remove duplicates and keep row that certain column is Yes in a pandas dataframe

I have a dataframe with duplicated values on column "ID", like this one:

ID  Name    Street  Birth       Job     Primary?
1   Fake1   Street1 2000-01-01  Job1    Yes
2   Fake2   Street2 2000-01-02  Job2    No
3   Fake3   Street3 2000-01-03  Job3    Yes
1   Fake1   Street1 2000-01-01  Job4    No
2   Fake2   Street2 2000-01-02  Job5    Yes
4   Fake4   Street4 2000-01-03  Job6    Yes
1   Fake1   Street1 2000-01-01  Job7    No

I need a way to remove duplicates (by "ID") but keep the ones that the column Primary is "Yes" (all unique values have "Yes" in that column and duplicated values have one record as "Yes" and all others as "No") resulting in this dataframe:

ID  Name    Street  Birth       Job     Primary?
1   Fake1   Street1 2000-01-01  Job1    Yes
3   Fake3   Street3 2000-01-03  Job3    Yes
2   Fake2   Street2 2000-01-02  Job5    Yes
4   Fake4   Street4 2000-01-03  Job6    Yes

What is the best way to do it?

Thanks!

Use DataFrame.sort_values - Yes rows are in end of DataFrame, so possible use DataFrame.drop_duplicates with keep='last' - this solution should return Primary?=No if exist some ID without Primary?=Yes values:

df = df.sort_values('Primary?').drop_duplicates('ID', keep='last')
print (df)
   ID   Name   Street       Birth   Job Primary?
0   1  Fake1  Street1  2000-01-01  Job1      Yes
2   3  Fake3  Street3  2000-01-03  Job3      Yes
4   2  Fake2  Street2  2000-01-02  Job5      Yes
5   4  Fake4  Street4  2000-01-03  Job6      Yes

Using groupby.idxmax on a boolean Series derived from the "Primary?" column:

out = df.loc[df['Primary?'].eq('Yes').groupby(df['ID']).idxmax()]

output:

   ID   Name   Street       Birth   Job Primary?
0   1  Fake1  Street1  2000-01-01  Job1      Yes
4   2  Fake2  Street2  2000-01-02  Job5      Yes
2   3  Fake3  Street3  2000-01-03  Job3      Yes
5   4  Fake4  Street4  2000-01-03  Job6      Yes
  1. Filter the Primary column from dataframe for 'yes':
 df = df[df['Primary?']=='yes']
  1. Then, drop duplicates from the filtered dataframe.

df = df.drop_duplicates(subset= ['ID'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM