简体   繁体   中英

create new df column based on grouping and conditions in PYTHON

my dataframe looks like the bottom and i want to create a new column called "remove" that would be true/false based on the following conditions.

i only want to keep one row for each "url" and want to determine which to keep based on this order:

  • keep the one with a higher score
  • if score is same then go to track "gre" and "tca" are favored over "oke"
  • if they are equal at the track then go to "size" and take the bigger one

I would like the new column to be true or false based on if it should be removed or not.

thanks in advance!!

name age track score size url url_count
0 timmy 10 gre 0.982 41 target.com 2
1 chester 12 tca 0.251 12 michaels.com 3
2 trixie 10 oke 0.0879 15 macys.com 2
3 aj 9 gre 0.985 20 michaels.com 3
4 cosmo 150 tca 0.982 328 target.com 2
5 wanda 145 gre 0.879 550 macys.com 2
6 jorgen 500 oke 0.432 51 michaels.com 3

Use sort_values with your sort criteria and a custom_sort as the key re-order your rows with the first being the highest ranked. Then we use duplicated on only the url column to flag all but the first instance of a url as to be removed:

def custom_sort(x):
    if x.name == 'track':
        return x.map({'gre': 0, 'tca': 0, 'oke': 1})
    
    else:
        return x

df['remove'] = df.sort_values(['url', 'score', 'track', 'size'], key = custom_sort, ascending = False).duplicated(subset = 'url')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2025 STACKOOM.COM