I have an array with several fields, which I want to be sorted with respect to 2 of them. One of these fields is binary, eg:
size = 100000
data = np.empty(
shape=2 * size,
dtype=[('class', int),
('value', int),]
)
data['class'][:size] = 0
data['value'][:size] = (np.random.normal(size=size) * 10).astype(int)
data['class'][size:] = 1
data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int)
np.random.shuffle(data)
I need the result to be sorted with respect to value
, and for same values class=0
should go first. Doing it like so (a) :
idx = np.argsort(data, order=['value', 'class'])
data_sorted = data[idx]
seems to be an order of magnitude slower compared to sorting just data['value']
. Is there a way to improve the speed, given that there are only two classes?
By experimenting randomly I noticed that an approach like this (b) :
idx = np.argsort(data['value'])
data_sorted = data[idx]
idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort')
data_sorted = data_sorted[idx]
takes ~20% less time than (a) . Changing field datatypes seem to also have some effect - floats instead of ints seem to be slightly faster.
The simplest way to do this is using the order
parameter of sort
sort(data, order=['value', 'class'])
However, this takes 121 ms to run on my computer, while data['class']
and data['value']
take only 2.44 and 5.06 ms respectively. Interestingly, sort(data, order='class')
takes 135 ms again, suggesting the problem is with sorting structured arrays.
So, the approach you've taken of sorting each field using argsort
then indexing the final array seems to be on the right track. However, you need to sort each field individually,
idx=argsort(data['class'])
data_sorted = data[idx][argsort(data['value'][idx], kind='stable')]
This runs in 43.9 ms. You can get a very slight speedup by removing one temporary array from indexing
idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]
Which runs in 40.8 ms. Not great, but it is a workaround if performance is critical.
This seems to be a known problem: sorting numpy structured and record arrays is very slow
Edit The sourcecode for the comparisons used in sort can be seen at https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src . The numeric types are much, much simpler. Still, that large of a difference in performance is surprising.
In addition to the good (general-purpose) answer of @user2699, in your specific case, you can cheat because the two fields of the structured array is of the same integer type and values are relatively small (they fit in 32-bits). The cheat consists in the following steps:
arr - np.min(arr)
np.uint64
with np.astype
(class_arr << 32) | value_arr
(class_arr << 32) | value_arr
np.sort
class_arr = sorted_arr >> 32
and value_arr = sorted_arr & ((1<<32)-1)
This strategy is significantly faster than using two np.argsort
that are pretty expensive. This is especially true for bigger array since sorting big array is even more expensive and np.sort
is cheaper than np.argsort
. Not to mention indirect indexing is relatively slow on big array because of the unpredictable pseudo-random memory access pattern and the high latency of the RAM. The downside of this approach is that it is a bit more tricky to implement and it does not apply in all cases.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.