简体   繁体   中英

Performance of sorting structured arrays (numpy)

I have an array with several fields, which I want to be sorted with respect to 2 of them. One of these fields is binary, eg:

size = 100000
data = np.empty(
            shape=2 * size,
            dtype=[('class', int),
                   ('value', int),]
)

data['class'][:size] = 0
data['value'][:size] = (np.random.normal(size=size) * 10).astype(int)
data['class'][size:] = 1
data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int)

np.random.shuffle(data)

I need the result to be sorted with respect to value , and for same values class=0 should go first. Doing it like so (a) :

idx = np.argsort(data, order=['value', 'class'])
data_sorted = data[idx]

seems to be an order of magnitude slower compared to sorting just data['value'] . Is there a way to improve the speed, given that there are only two classes?

By experimenting randomly I noticed that an approach like this (b) :

idx = np.argsort(data['value'])
data_sorted = data[idx]
idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort')
data_sorted = data_sorted[idx]

takes ~20% less time than (a) . Changing field datatypes seem to also have some effect - floats instead of ints seem to be slightly faster.

The simplest way to do this is using the order parameter of sort

sort(data, order=['value', 'class'])

However, this takes 121 ms to run on my computer, while data['class'] and data['value'] take only 2.44 and 5.06 ms respectively. Interestingly, sort(data, order='class') takes 135 ms again, suggesting the problem is with sorting structured arrays.

So, the approach you've taken of sorting each field using argsort then indexing the final array seems to be on the right track. However, you need to sort each field individually,

idx=argsort(data['class'])
data_sorted = data[idx][argsort(data['value'][idx], kind='stable')]

This runs in 43.9 ms. You can get a very slight speedup by removing one temporary array from indexing

idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]

Which runs in 40.8 ms. Not great, but it is a workaround if performance is critical.

This seems to be a known problem: sorting numpy structured and record arrays is very slow

Edit The sourcecode for the comparisons used in sort can be seen at https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src . The numeric types are much, much simpler. Still, that large of a difference in performance is surprising.

In addition to the good (general-purpose) answer of @user2699, in your specific case, you can cheat because the two fields of the structured array is of the same integer type and values are relatively small (they fit in 32-bits). The cheat consists in the following steps:

  • subtract the minimum values of each fields to all items the field (to make them positive) using arr - np.min(arr)
  • transform each field to a np.uint64 with np.astype
  • pack bits the two fields in one binary array using: (class_arr << 32) | value_arr (class_arr << 32) | value_arr
  • sort the resulting array using np.sort
  • unpack the array using: class_arr = sorted_arr >> 32 and value_arr = sorted_arr & ((1<<32)-1)

This strategy is significantly faster than using two np.argsort that are pretty expensive. This is especially true for bigger array since sorting big array is even more expensive and np.sort is cheaper than np.argsort . Not to mention indirect indexing is relatively slow on big array because of the unpredictable pseudo-random memory access pattern and the high latency of the RAM. The downside of this approach is that it is a bit more tricky to implement and it does not apply in all cases.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM