简体   繁体   中英

Sort column in pandas dataframe after rarity of values within groups

I have a pandas dataframe of scraped websites with a website identifier, a text and a label of the websites. A small number of websites have two labels, but since I want to train first a single label classifier, I would like to create a version of the data with only one label for every website (I'm aware that this is slightly problematic). The labels in my dataset are unbalanced (with some labels occurring very often and some being very rare). If I delete duplicate website IDs, I would like to delete labels that are very common first. This is how my dataset with several labels looks like:

ID   Label   Text
1    a       some text
1    b       other text
1    a       data
2    a       words
2    c       more words
3    a       text
3    b       short text

My idea was to sort the label column within every website identifier by rarity of the label. For that I would first do value_counts(ascending = True) on the label column, to get a list of all labels sorted by rarity.

to_sort = [c, b, a]

I then would like to use that list to sort within every website ID by rarity. I'm not sure how to do that, though. The result should look like this:

ID   Label   Text
1    b       other text
1    a       some text
1    a       data
2    c       more words
2    a       words
3    b       short text
3    a       text

I then would use df.drop_duplicates(subset = 'ID', keep = 'first') , to keep the label that is the most rare. How can I do the sorting?

Use ordered categorical , so possible use sort_values :

to_sort = list('cba')

df['Label'] = pd.Categorical(df['Label'], ordered=True, categories=to_sort)

df = df.sort_values(['ID','Label'])
print (df)
   ID Label        Text
1   1     b  other text
0   1     a   some text
2   1     a        data
4   2     c  more words
3   2     a       words
6   3     b  short text
5   3     a        text

You can achieve your goal by making the Label Column Categorical , then sort by ID and Label . Let's see it in practice.

import pandas as pd
df = pd.DataFrame( {'ID': [1,1,1,2,2,3,3], "Label": ["a", "b", "a", "a", "c", "a", "b"],
                   'Text': ["some text", "other text","data", "words", "more words", "text", "short text"]} )
df
    ID  Label   Text
0   1   a   some text
1   1   b   other text
2   1   a   data
3   2   a   words
4   2   c   more words
5   3   a   text
6   3   b   short text

Define your labels' order by doing :

to_sort = df.Label.value_counts(ascending = True).index
to_sort
Index(['c', 'b', 'a'], dtype='object')

Then make the Label column Categorical like this :

df.Label = pd.Categorical(df.Label,categories = to_sort, ordered = True)

Finally, sort by ID and Label :

df.sort_values(["ID", "Label"]).reset_index(drop = True)

    ID  Label   Text
0   1   b   other text
1   1   a   some text
2   1   a   data
3   2   c   more words
4   2   a   words
5   3   b   short text
6   3   a   text

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM