简体   繁体   中英

sort very long list of integers quickly in python

I have a vector of a few hundred million elements that are np.uint8. They only range in value from 0 to 255.

I need to sort this list, and I think it should be more quickly handled than quicksort. I think I could find indices for all value "0" and put them at the front, then for all value "1" and put them after the last insert, and progress until I am done. It would be the mutant offspring of unique and sort with some indexing and should work very quickly.

Is there a built-in that does this well, properly, and in something fast like "C" without me having to homebrew it? Could you point me to it?

Let's say that as a "toy" of my actual problem that I wanted to sort the intensity values for each color (rgb) of a 100 megapixel version of the mandrill image where each color had been converted into a single, very long, vector of uint8 values. If I were to time the difference between sorting methods, which would be reasonable to compute, in python?

You might find that numpy.bincount is all you need. It counts the number of occurrences of each integer in the data.

For example, here's some random unsigned 8 bit integers:

In [100]: np.random.seed(8675309)                                                                                                                                                                      

In [101]: r = np.random.gamma(9, scale=8.5, size=100000).astype(np.uint8)                                                                                                                              

In [102]: r.min(), r.max()                                                                                                                                                                             
Out[102]: (11, 242)

Count the integers using bincount :

In [103]: b = np.bincount(r, minlength=256)                                                                                                                                                            

In [104]: b                                                                                                                                                                                            
Out[104]: 
array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          1,    1,    3,    1,    5,   13,    9,   17,   24,   31,   27,
         41,   55,   63,   96,  131,  146,  178,  210,  204,  268,  297,
        308,  367,  422,  480,  512,  584,  635,  669,  671,  759,  830,
        885,  934,  955, 1025, 1105, 1146, 1145, 1344, 1271, 1353, 1300,
       1456, 1419, 1451, 1504, 1499, 1561, 1600, 1509, 1678, 1621, 1643,
       1633, 1616, 1574, 1677, 1664, 1682, 1625, 1608, 1581, 1598, 1575,
       1583, 1524, 1493, 1381, 1448, 1399, 1422, 1249, 1322, 1225, 1278,
       1174, 1246, 1128, 1161, 1077,  999, 1033,  980,  981,  897,  917,
        880,  813,  779,  774,  697,  716,  651,  612,  657,  592,  556,
        497,  482,  474,  484,  445,  411,  399,  354,  368,  363,  342,
        313,  301,  293,  263,  241,  249,  244,  196,  215,  182,  189,
        172,  161,  139,  143,  142,  120,  121,  104,  103,  112,   88,
         82,   88,   67,   60,   83,   57,   63,   59,   50,   52,   55,
         40,   34,   34,   43,   35,   33,   28,   24,   26,   20,   18,
         21,   26,   30,   17,   15,   12,   17,   11,    7,    6,   16,
          8,    3,    4,   12,    9,    6,    5,    8,   10,    7,    1,
          4,    8,    5,    3,    2,    1,    0,    1,    1,    0,    1,
          0,    0,    4,    2,    2,    0,    0,    2,    0,    1,    1,
          1,    4,    0,    0,    0,    0,    0,    0,    0,    0,    2,
          1,    0,    0,    0,    0,    0,    1,    0,    0,    0,    0,
          0,    0,    0,    1,    0,    0,    0,    0,    0,    0,    0,
          1,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0])

So 0 occurs 0 times in r , 13 occurs 3 times, etc:

In [105]: b[0], b[13]                                                                                                                                                                                  
Out[105]: (0, 3)

You can do this without sorting using bincount and repeat:

In [11]: a = np.bincount(np.array([71, 9, 1, 2, 3, 4, 4, 4, 8, 9, 9, 71], dtype=np.uint8), minlength=256)

In [12]: np.repeat(np.arange(256, dtype=np.uint8), a)
Out[12]: array([ 1,  2,  3,  4,  4,  4,  8,  9,  9,  9, 71, 71], dtype=uint8)

If you only need the sorted values , then bincount is indeed the way to go. If, however, what you require is more argsort-like, for example, you need to co-sort some other same length array, then you can use this Q&A It compares various methods some of which are much faster than the 'naive' argsort .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM