Sort pandas dataframe by customize way

Question

I have tried a lot to sort DataFrame column on my own way. But could not be able to correctly do it. So refer given code and let me know what is the additional syntax to do the job.

df = pd.DataFrame({'TC': {0: '1-1.1', 1: '1-1.2', 2: '1-10.1', 3: '1-10.2', 4: '1-2.1', 5: '1-2.1', 6: '1-2.2', 7: '1-20.1', 8: '1-20.2', 9: '1-3.1'}, 'Case': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J'}})
df.sort_values(["TC"], ascending=[True])
print (df)

This code does not give desire output. I need the Dataframe sorted as per below.

Answer 1

You can extract the numbers and form a tuple , then sort that series and use its index to reindex your original DataFrame.

>>> df.reindex(
        df['TC'].str.extractall('(\d+)')
                .unstack().astype(int)
                .agg(tuple, 1).sort_values()
                .index
    )

       TC Case
0   1-1.1    A
1   1-1.2    B
4   1-2.1    E
5   1-2.1    F
6   1-2.2    G
9   1-3.1    J
2  1-10.1    C
3  1-10.2    D
7  1-20.1    H
8  1-20.2    I

You can also use the key argument in sort_values :

>>> df.sort_values('TC', 
        key=lambda ser:
           ser.str.extractall('(\d+)')
              .unstack()
              .astype(int).agg(tuple, 1)
    )

If there are always three parts to an ID you can use Series.str.split on non-numeric characters with expand=True , instead of extractall , hence removing the need to use unstack :

>>> df.sort_values('TC', 
         key=lambda series:
             series.str.split(r'\D+', expand=True)
                   .astype(int).agg(tuple,1)
    )

Timings:

>>> %timeit df.reindex(df['TC'].str.extractall('(\d+)').unstack().astype(int).agg(tuple, 1).sort_values().index)
2.95 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df.sort_values('TC', key=lambda ser: ser.str.extractall('(\d+)').unstack().astype(int).agg(tuple, 1))
2.91 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df.sort_values('TC', key=lambda series:series.str.split(r'\D+', expand=True).astype(int).agg(tuple,1))
1.6 ms ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answer 2

I would have done it this way. I think this would be faster.

df["range"] = df["TC"].apply(lambda x: [float(y) for y in x.split("-")])
df = df.sort_values(["range"], ascending=True).drop(["range"], axis="columns")

EDITED: And since you asked for the case where the format of the range as 1_1_2 in place of 1-1.2 I would have done it this way:

df["range"] = df["TC"].apply(lambda x: tuple(x.split("_")))
df["range"] = df["range"].apply(lambda x: [float(x[0]), float("{}.{}".format(x[1], x[2]))])
df = df.sort_values(["range"], ascending=True).drop(["range"], axis="columns")

Answer 3

I have made one sort() function which will solve your query.

 import pandas as pd df = pd.DataFrame({'TC': {0: '1-1.1', 1: '1-1.2', 2: '1-10.1', 3: '1-10.2', 4: '1-2.1', 5: '1-2.1', 6: '1-2.2', 7: '1-20.1', 8: '1-20.2', 9: '1-3.1'}, 'Case': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J'}}) def sort(df): listTC=[] for i in df['TC']: listTC.append(float(i[2:])) df1=pd.DataFrame(list(zip(listTC,list(df['Case']))),columns=['TC','Case']) df_f=df1.sort_values(by=['TC']) listTC_final=[] for i in df_f['TC']: listTC_final.append('1-'+str(i)) df_Final=pd.DataFrame(list(zip(listTC_final,list(df_f['Case']))),columns=['TC','Case']) return df_Final print(sort(df))

Still if any questions let me know. Thanks

Sort pandas dataframe by customize way

Question

3 answers

solution1
5 ACCPTED 2021-02-07 16:18:13

solution2
1 2021-02-07 17:07:08

solution3
0 2021-02-07 16:53:49

Sort pandas dataframe by customize way

Question

3 answers

solution1 5 ACCPTED 2021-02-07 16:18:13

solution2 1 2021-02-07 17:07:08

solution3 0 2021-02-07 16:53:49

solution1
5 ACCPTED 2021-02-07 16:18:13

solution2
1 2021-02-07 17:07:08

solution3
0 2021-02-07 16:53:49