简体   繁体   English

结构化排序性能 arrays (numpy)

[英]Performance of sorting structured arrays (numpy)

I have an array with several fields, which I want to be sorted with respect to 2 of them.我有一个包含多个字段的数组,我想根据其中的 2 个字段对其进行排序。 One of these fields is binary, eg:这些字段之一是二进制的,例如:

size = 100000
data = np.empty(
            shape=2 * size,
            dtype=[('class', int),
                   ('value', int),]
)

data['class'][:size] = 0
data['value'][:size] = (np.random.normal(size=size) * 10).astype(int)
data['class'][size:] = 1
data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int)

np.random.shuffle(data)

I need the result to be sorted with respect to value , and for same values class=0 should go first.我需要根据value对结果进行排序,对于相同的值, class=0应该首先是 go。 Doing it like so (a) :这样做(a)

idx = np.argsort(data, order=['value', 'class'])
data_sorted = data[idx]

seems to be an order of magnitude slower compared to sorting just data['value'] .与仅排序data['value']相比,似乎慢了一个数量级。 Is there a way to improve the speed, given that there are only two classes?鉴于只有两个班级,有没有办法提高速度?

By experimenting randomly I noticed that an approach like this (b) :通过随机试验,我注意到像这样的方法(b)

idx = np.argsort(data['value'])
data_sorted = data[idx]
idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort')
data_sorted = data_sorted[idx]

takes ~20% less time than (a) .(a)少花费约 20% 的时间。 Changing field datatypes seem to also have some effect - floats instead of ints seem to be slightly faster.更改字段数据类型似乎也有一些影响 - 浮点数而不是整数似乎稍微快一些。

The simplest way to do this is using the order parameter of sort最简单的方法是使用sortorder参数

sort(data, order=['value', 'class'])

However, this takes 121 ms to run on my computer, while data['class'] and data['value'] take only 2.44 and 5.06 ms respectively.但是,这在我的计算机上运行需要 121 毫秒,而data['class']data['value']分别只需要 2.44 和 5.06 毫秒。 Interestingly, sort(data, order='class') takes 135 ms again, suggesting the problem is with sorting structured arrays.有趣的是, sort(data, order='class')再次花费了 135 毫秒,这表明问题在于对结构化数组进行排序。

So, the approach you've taken of sorting each field using argsort then indexing the final array seems to be on the right track.因此,您使用argsort对每个字段进行argsort然后索引最终数组的方法似乎是正确的。 However, you need to sort each field individually,但是,您需要单独对每个字段进行排序,

idx=argsort(data['class'])
data_sorted = data[idx][argsort(data['value'][idx], kind='stable')]

This runs in 43.9 ms.这在 43.9 毫秒内运行。 You can get a very slight speedup by removing one temporary array from indexing通过从索引中删除一个临时数组,您可以获得非常小的加速

idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]

Which runs in 40.8 ms.运行时间为 40.8 毫秒。 Not great, but it is a workaround if performance is critical.不是很好,但如果性能至关重要,这是一种解决方法。

This seems to be a known problem: sorting numpy structured and record arrays is very slow这似乎是一个已知问题: 对 numpy 结构化和记录数组进行排序非常慢

Edit The sourcecode for the comparisons used in sort can be seen at https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src .编辑排序中使用的比较的源代码可以在https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src 中看到。 The numeric types are much, much simpler.数字类型要简单得多。 Still, that large of a difference in performance is surprising.尽管如此,性能的巨大差异还是令人惊讶。

In addition to the good (general-purpose) answer of @user2699, in your specific case, you can cheat because the two fields of the structured array is of the same integer type and values are relatively small (they fit in 32-bits).除了@user2699 的良好(通用)答案之外,在您的特定情况下,您还可以作弊,因为结构化数组的两个字段属于相同的 integer 类型并且值相对较小(它们适合 32 位) . The cheat consists in the following steps:作弊包括以下步骤:

  • subtract the minimum values of each fields to all items the field (to make them positive) using arr - np.min(arr)使用arr - np.min(arr)
  • transform each field to a np.uint64 with np.astype使用np.astype将每个字段转换为np.uint64
  • pack bits the two fields in one binary array using: (class_arr << 32) | value_arr使用以下命令将两个字段打包成一个二进制数组: (class_arr << 32) | value_arr (class_arr << 32) | value_arr
  • sort the resulting array using np.sort使用np.sort对结果数组进行排序
  • unpack the array using: class_arr = sorted_arr >> 32 and value_arr = sorted_arr & ((1<<32)-1)使用以下命令解压缩数组: class_arr = sorted_arr >> 32value_arr = sorted_arr & ((1<<32)-1)

This strategy is significantly faster than using two np.argsort that are pretty expensive.这种策略比使用两个非常昂贵的np.argsort快得多。 This is especially true for bigger array since sorting big array is even more expensive and np.sort is cheaper than np.argsort .对于更大的数组尤其如此,因为对大数组进行排序更加昂贵,并且np.sortnp.argsort更便宜。 Not to mention indirect indexing is relatively slow on big array because of the unpredictable pseudo-random memory access pattern and the high latency of the RAM.更不用说间接索引在大数组上相对较慢,因为不可预测的伪随机 memory 访问模式和 RAM 的高延迟。 The downside of this approach is that it is a bit more tricky to implement and it does not apply in all cases.这种方法的缺点是实施起来有点棘手,而且它并不适用于所有情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM