简体   繁体   中英

Ordering a dataframe by each column

I have a dataframe that looks like this:

   ID  Age  Score
0   9    5      3
1   4    6      1
2   9    7      2
3   3    2      1
4  12    1     15
5   2   25      6
6   9    5      4
7   9    5     61
8   4    2     12

I want to sort based on the first column, then the second column, and so on.

So I want my output to be this:

   ID  Age  Score
5   2   25      6
3   3    2      1
8   4    2     12
1   4    6      1
0   9    5      3
6   9    5      4
7   9    5     61
2   9    7      2
4  12    1     15

I know I can do the above with df.sort_values(df.columns.to_list()) , however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).

Is there a more optimal solution?

You can use numpy.lexsort to improve performance.

import numpy as np

a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
                   index=df.index, columns=df.columns)

Assuming as input a random square DataFrame of side n :

df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))

here is the comparison for 100 to 100M items (slower runtime is the best):

numpy vs pandas 数据框排序

Same graph with the speed relative to pandas

numpy vs pandas 数据框排序

By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.

Maybe specifying one of these would improve it?

df.sort_values(df.columns.to_list(), kind="mergesort")

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM