简体   繁体   中英

Sort pandas dataframe by customize way

I have tried a lot to sort DataFrame column on my own way. But could not be able to correctly do it. So refer given code and let me know what is the additional syntax to do the job.

df = pd.DataFrame({'TC': {0: '1-1.1', 1: '1-1.2', 2: '1-10.1', 3: '1-10.2', 4: '1-2.1', 5: '1-2.1', 6: '1-2.2', 7: '1-20.1', 8: '1-20.2', 9: '1-3.1'}, 'Case': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J'}})
df.sort_values(["TC"], ascending=[True])
print (df)

This code does not give desire output. I need the Dataframe sorted as per below.

在此处输入图像描述

You can extract the numbers and form a tuple , then sort that series and use its index to reindex your original DataFrame.

>>> df.reindex(
        df['TC'].str.extractall('(\d+)')
                .unstack().astype(int)
                .agg(tuple, 1).sort_values()
                .index
    )

       TC Case
0   1-1.1    A
1   1-1.2    B
4   1-2.1    E
5   1-2.1    F
6   1-2.2    G
9   1-3.1    J
2  1-10.1    C
3  1-10.2    D
7  1-20.1    H
8  1-20.2    I

You can also use the key argument in sort_values :

>>> df.sort_values('TC', 
        key=lambda ser:
           ser.str.extractall('(\d+)')
              .unstack()
              .astype(int).agg(tuple, 1)
    )

If there are always three parts to an ID you can use Series.str.split on non-numeric characters with expand=True , instead of extractall , hence removing the need to use unstack :

>>> df.sort_values('TC', 
         key=lambda series:
             series.str.split(r'\D+', expand=True)
                   .astype(int).agg(tuple,1)
    )

Timings:

>>> %timeit df.reindex(df['TC'].str.extractall('(\d+)').unstack().astype(int).agg(tuple, 1).sort_values().index)
2.95 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df.sort_values('TC', key=lambda ser: ser.str.extractall('(\d+)').unstack().astype(int).agg(tuple, 1))
2.91 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df.sort_values('TC', key=lambda series:series.str.split(r'\D+', expand=True).astype(int).agg(tuple,1))
1.6 ms ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I would have done it this way. I think this would be faster.

df["range"] = df["TC"].apply(lambda x: [float(y) for y in x.split("-")])
df = df.sort_values(["range"], ascending=True).drop(["range"], axis="columns")

EDITED: And since you asked for the case where the format of the range as 1_1_2 in place of 1-1.2 I would have done it this way:

df["range"] = df["TC"].apply(lambda x: tuple(x.split("_")))
df["range"] = df["range"].apply(lambda x: [float(x[0]), float("{}.{}".format(x[1], x[2]))])
df = df.sort_values(["range"], ascending=True).drop(["range"], axis="columns")

I have made one sort() function which will solve your query.

 import pandas as pd df = pd.DataFrame({'TC': {0: '1-1.1', 1: '1-1.2', 2: '1-10.1', 3: '1-10.2', 4: '1-2.1', 5: '1-2.1', 6: '1-2.2', 7: '1-20.1', 8: '1-20.2', 9: '1-3.1'}, 'Case': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J'}}) def sort(df): listTC=[] for i in df['TC']: listTC.append(float(i[2:])) df1=pd.DataFrame(list(zip(listTC,list(df['Case']))),columns=['TC','Case']) df_f=df1.sort_values(by=['TC']) listTC_final=[] for i in df_f['TC']: listTC_final.append('1-'+str(i)) df_Final=pd.DataFrame(list(zip(listTC_final,list(df_f['Case']))),columns=['TC','Case']) return df_Final print(sort(df))

Still if any questions let me know. Thanks

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM