Pandas DataFrame sort by categorical column but by specific class ordering

Question

I would like to select the top entries in a Pandas dataframe base on the entries of a specific column by using df_selected = df_targets.head(N) .

Each entry has a target value (by order of importance):

Likely Supporter, GOTV, Persuasion, Persuasion+GOTV

Unfortunately if I do

df_targets = df_targets.sort("target")

the ordering will be alphabetical ( GOTV , Likely Supporter , ...).

I was hoping for a keyword like list_ordering as in:

my_list = ["Likely Supporter", "GOTV", "Persuasion", "Persuasion+GOTV"] 
df_targets = df_targets.sort("target", list_ordering=my_list)

To deal with this issue I create a dictionary:

dict_targets = OrderedDict()
dict_targets["Likely Supporter"] = "0 Likely Supporter"
dict_targets["GOTV"] = "1 GOTV"
dict_targets["Persuasion"] = "2 Persuasion"
dict_targets["Persuasion+GOTV"] = "3 Persuasion+GOTV"

, but it seems like a non-pythonic approach.

Suggestions would be much appreciated!

Answer 1

I think you need Categorical with parameter ordered=True and then sorting by sort_values works very nice:

Check documentation for Categorical :

Ordered Categoricals can be sorted according to the custom order of the categories and can have a min and max value.

import pandas as pd

df = pd.DataFrame({'a': ['GOTV', 'Persuasion', 'Likely Supporter', 
                         'GOTV', 'Persuasion', 'Persuasion+GOTV']})

df.a = pd.Categorical(df.a, 
                      categories=["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"],
                      ordered=True)

print (df)
                  a
0              GOTV
1        Persuasion
2  Likely Supporter
3              GOTV
4        Persuasion
5   Persuasion+GOTV

print (df.a)
0                GOTV
1          Persuasion
2    Likely Supporter
3                GOTV
4          Persuasion
5     Persuasion+GOTV
Name: a, dtype: category
Categories (4, object): [Likely Supporter < GOTV < Persuasion < Persuasion+GOTV]

df.sort_values('a', inplace=True)
print (df)
                  a
2  Likely Supporter
0              GOTV
3              GOTV
1        Persuasion
4        Persuasion
5   Persuasion+GOTV

Answer 2

The method shown in my previous answer is now deprecated.

In stead it is best to use pandas.Categorical as shown here .

So:

list_ordering = ["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"]  
df["target"] = pd.Categorical(df["target"], categories=list_ordering)

Answer 3

I guess this is the most sufficient one, to prefer in case you face certain situation: This is your preferred ordering...

my_order = ["Likely Supporter", "GOTV", "Persuasion", "Persuasion+GOTV"]

So, just do...

df['Column_to_update'].cat.reorder_categories(my_order, inplace= True)

It is flexible and no need to assign new category. But... Your column must be dtype = 'category' otherwise it will not work.

Read more here (Pandas documentation)

Answer 4

Thanks to jerzrael's input and references,

I like this sliced solution:

list_ordering = ["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"]  

df["target"] = df["target"].astype("category", categories=list_ordering, ordered=True)

Pandas DataFrame sort by categorical column but by specific class ordering

Question

3 answers

solution1
24 2016-08-30 09:15:30

solution2
0 2017-11-22 17:47:56

solution3
0 2021-04-30 10:20:55

solution4
-1 2016-08-30 09:57:46

Pandas DataFrame sort by categorical column but by specific class ordering

Question

3 answers

solution1 24 2016-08-30 09:15:30

solution2 0 2017-11-22 17:47:56

solution3 0 2021-04-30 10:20:55

solution4 -1 2016-08-30 09:57:46

solution1
24 2016-08-30 09:15:30

solution2
0 2017-11-22 17:47:56

solution3
0 2021-04-30 10:20:55

solution4
-1 2016-08-30 09:57:46