简体   繁体   中英

python drop duplicates by certain order (not `first`, `last`)

ID  values
111 reason1
111 reason2
111 reason3
222 reason2
222 reason4
222 reason5

df.drop_duplicates(["ID"], keep='???', inplace=True)

The way I know is using the drop_duplicates, but it only gives me the option first , last . I want to check that if there is reason2, then keep the record with reason2, else check reason3, etc. Basically, there is particular order, such as reason2, reason3, reason4, etc.

Based on the comments, this can be one of the implementations: (Implementing @brittenb 's idea.)

priority_dict = {
    'reason1':1,
    'reason2':2,
    'reason3':3,
    'reason4':4,
    'reason5':5
}
df['priority'] = df['values'].map(priority_dict)
df = df.sort_values(by=['ID', 'priority'])
df.drop_duplicates(['ID'], keep='first')

Output:

     ID values  priority
0   111 reason1 1
3   222 reason2 2

Use 'category' dtype with defined order and sort:

df['values'] = df['values'].astype('category', ordered=True)\
                           .cat.reorder_categories(['reason2',
                                                    'reason3',
                                                    'reason1',
                                                    'reason4',
                                                    'reason5'])

df.sort_values('values').drop_duplicates('ID', keep='first')

Output:

    ID   values
1  111  reason2
3  222  reason2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM