简体   繁体   中英

Sorting numpy arrays using lexsort

I am mainly interested in 2D arrays of shape Nx3 but the issue appears in arrays of shapes Nxm where m>1 as well. Specifically, I would like to sort an Nx3 array first based on its first column, then second, and finally third. So, assuming that we have array k given as

array([[0.90625, 0.90625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.90625, 0.65625],
       [0.15625, 0.90625, 0.40625],
       [0.90625, 0.40625, 0.90625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.65625, 0.65625],
       [0.15625, 0.65625, 0.40625],
       [0.65625, 0.15625, 0.90625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.90625, 0.40625],
       [0.65625, 0.40625, 0.40625],
       [0.15625, 0.15625, 0.90625],
       [0.40625, 0.40625, 0.40625],
       [0.65625, 0.90625, 0.40625],
       [0.90625, 0.15625, 0.40625]])

the desired (sorted) array should be

array([[0.15625, 0.15625, 0.90625],
       [0.15625, 0.65625, 0.40625],
       [0.15625, 0.90625, 0.40625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.40625, 0.40625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.65625, 0.65625],
       [0.40625, 0.90625, 0.40625],
       [0.40625, 0.90625, 0.65625],
       [0.65625, 0.15625, 0.90625],
       [0.65625, 0.40625, 0.40625],
       [0.65625, 0.90625, 0.40625],
       [0.90625, 0.15625, 0.40625],
       [0.90625, 0.40625, 0.90625],
       [0.90625, 0.90625, 0.15625]])

I thought I could achieve that by using np.lexsort but it seems I am probably missing something and is not working as expected. So far, I've been doing the following

In [28]: k[np.lexsort((k[:,2], k[:,1], k[:,0]))]
Out[28]: 
array([[0.15625, 0.65625, 0.40625],
       [0.15625, 0.15625, 0.90625],
       [0.15625, 0.90625, 0.40625],
       [0.40625, 0.65625, 0.65625],
       [0.40625, 0.90625, 0.40625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.40625],
       [0.40625, 0.90625, 0.65625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.65625, 0.15625],
       [0.65625, 0.15625, 0.90625],
       [0.65625, 0.90625, 0.40625],
       [0.65625, 0.40625, 0.40625],
       [0.90625, 0.40625, 0.90625],
       [0.90625, 0.15625, 0.40625],
       [0.90625, 0.90625, 0.15625]])

It seems that the first column is sorted properly but the others are not. A similar question was asked before but I believe the accepted answer (which is essentially what I am doing) does not work.

From what I understood after looking a little bit more into it, I think it has to do with the values of the array being floats.

EDIT

I found the answer to my problem. However, I'll add it as an "edit" rather than posting it as an answer because I believe this whole situation could possibly be avoided if I had mentioned a fine detail about matrix k in my original post. Matrix k is created from another matrix a , where a is essentially created by reading a matrix of floats with 16 decimals from a file. Now let's look at the workflow that led me to the solution.

In [6]: k=a[[1,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60]]

In [7]: k
Out[7]: 
array([[0.15625, 0.15625, 0.40625],
       [0.15625, 0.40625, 0.15625],
       [0.15625, 0.65625, 0.15625],
       [0.15625, 0.90625, 0.15625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.90625, 0.15625],
       [0.65625, 0.15625, 0.15625],
       [0.65625, 0.40625, 0.15625],
       [0.65625, 0.65625, 0.15625],
       [0.65625, 0.90625, 0.15625],
       [0.90625, 0.15625, 0.15625],
       [0.90625, 0.40625, 0.15625],
       [0.90625, 0.65625, 0.15625],
       [0.90625, 0.90625, 0.15625]])

In [8]: np.random.shuffle(k)

In [9]: k
Out[9]: 
array([[0.15625, 0.90625, 0.15625],
       [0.90625, 0.40625, 0.15625],
       [0.40625, 0.65625, 0.15625],
       [0.90625, 0.90625, 0.15625],
       [0.15625, 0.40625, 0.15625],
       [0.65625, 0.15625, 0.15625],
       [0.40625, 0.90625, 0.15625],
       [0.65625, 0.65625, 0.15625],
       [0.40625, 0.15625, 0.15625],
       [0.90625, 0.65625, 0.15625],
       [0.65625, 0.40625, 0.15625],
       [0.15625, 0.65625, 0.15625],
       [0.65625, 0.90625, 0.15625],
       [0.15625, 0.15625, 0.40625],
       [0.90625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625]])

In [10]: k[np.lexsort((k[:,2],k[:,1],k[:,0]))]
Out[10]: 
array([[0.15625, 0.40625, 0.15625],
       [0.15625, 0.65625, 0.15625],
       [0.15625, 0.90625, 0.15625],
       [0.15625, 0.15625, 0.40625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.90625, 0.15625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.65625, 0.15625, 0.15625],
       [0.65625, 0.40625, 0.15625],
       [0.65625, 0.65625, 0.15625],
       [0.65625, 0.90625, 0.15625],
       [0.90625, 0.15625, 0.15625],
       [0.90625, 0.40625, 0.15625],
       [0.90625, 0.65625, 0.15625],
       [0.90625, 0.90625, 0.15625]])

In [11]: k=np.round(k, 5)

In [12]: k[np.lexsort((k[:,2],k[:,1],k[:,0]))]
Out[12]: 
array([[0.15625, 0.15625, 0.40625],
       [0.15625, 0.40625, 0.15625],
       [0.15625, 0.65625, 0.15625],
       [0.15625, 0.90625, 0.15625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.90625, 0.15625],
       [0.65625, 0.15625, 0.15625],
       [0.65625, 0.40625, 0.15625],
       [0.65625, 0.65625, 0.15625],
       [0.65625, 0.90625, 0.15625],
       [0.90625, 0.15625, 0.15625],
       [0.90625, 0.40625, 0.15625],
       [0.90625, 0.65625, 0.15625],
       [0.90625, 0.90625, 0.15625]])

In [13]: np.savetxt(sys.stdout, a[[1,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60]], fmt='%.18f')
0.156250000000000000 0.156250000000000000 0.406250000000000000
0.156249999999999972 0.406250000000000000 0.156250000000000028
0.156249999999999972 0.656250000000000000 0.156250000000000028
0.156249999999999972 0.906250000000000000 0.156250000000000028
0.406250000000000000 0.156249999999999972 0.156250000000000028
0.406250000000000000 0.406250000000000000 0.156250000000000028
0.406249999999999944 0.656250000000000000 0.156250000000000028
0.406249999999999944 0.906250000000000000 0.156250000000000028
0.656250000000000000 0.156249999999999972 0.156250000000000028
0.656250000000000000 0.406249999999999944 0.156250000000000028
0.656250000000000000 0.656250000000000000 0.156250000000000028
0.656250000000000000 0.906250000000000000 0.156250000000000056
0.906250000000000000 0.156249999999999972 0.156250000000000028
0.906250000000000000 0.406249999999999944 0.156250000000000028
0.906250000000000000 0.656250000000000000 0.156250000000000056
0.906250000000000000 0.906250000000000000 0.156250000000000056

As can be seen by the above, it was all a matter of rounding errors. Apparently, everything was seemingly fine when printed with a few decimals, but when the file was read and matrix a was created, it was stored with inaccuracies after the 16th decimal place. Consequently, these inaccuracies were carried down to k when it was defined from a . Therefore, lexsort was giving the correct result from the beginning considering the real number that was stored in the matrix. Everything worked fine when I rounded matrix k .

Moral of the story: Always check the accuraccies of your values.

I think numpy isn't flexible for this kind of operations, though I can't deny some kind of solution exists. I recommend you to use other packages such as pandas or numpy_indexed (assuming data is your array):

pandas

import pandas as pd
df = pd.DataFrame(data)
sorted_data = np.array(df.sort_values(by=[0,1,2]))

numpy_indexed

import numpy_indexed as npi
npi.sort(data)

Sources

For more general cases of usages you might like to check this answer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM