简体   繁体   English

快速合并L1 / L2中4K浮点数的排序子集

[英]Fast merge of sorted subsets of 4K floating-point numbers in L1/L2

What is a fast way to merge sorted subsets of an array of up to 4096 32-bit floating point numbers on a modern (SSE2+) x86 processor? 在现代(SSE2 +)x86处理器上合并多达4096个32位浮点数的数组的排序子集的快速方法是什么?

Please assume the following: 请假设以下内容:

  • The size of the entire set is at maximum 4096 items 整套的大小最多为4096件
  • The size of the subsets is open to discussion, but let us assume between 16-256 initially 子集的大小可以讨论,但我们假设最初在16-256之间
  • All data used through the merge should preferably fit into L1 通过合并使用的所有数据应该优选地适合L1
  • The L1 data cache size is 32K. L1数据高速缓存大小为32K。 16K has already been used for the data itself, so you have 16K to play with 16K已经用于数据本身,因此您可以使用16K
  • All data is already in L1 (with as high degree of confidence as possible) - it has just been operated on by a sort 所有数据都已经在L1中(具有尽可能高的置信度) - 它刚刚通过一种操作进行操作
  • All data is 16-byte aligned 所有数据都是16字节对齐的
  • We want to try to minimize branching (for obvious reasons) 我们想尽量减少分支(出于显而易见的原因)

Main criteria of feasibility: faster than an in-L1 LSD radix sort. 可行性的主要标准:比L1-LSD基数排序更快。

I'd be very interested to see if someone knows of a reasonable way to do this given the above parameters! 考虑到上述参数,我很想知道是否有人知道这样做的合理方法! :) :)

Here's a very naive way to do it. 这是一种非常天真的方式。 (Please excuse any 4am delirium-induced pseudo-code bugs ;) (请原谅任何凌晨4点谵妄引起的伪代码错误;)

//4x sorted subsets
data[4][4] = {
  {3, 4, 5, INF},
  {2, 7, 8, INF},
  {1, 4, 4, INF},
  {5, 8, 9, INF}
}

data_offset[4] = {0, 0, 0, 0}

n = 4*3

for(i=0, i<n, i++):
  sub = 0
  sub = 1 * (data[sub][data_offset[sub]] > data[1][data_offset[1]])
  sub = 2 * (data[sub][data_offset[sub]] > data[2][data_offset[2]])
  sub = 3 * (data[sub][data_offset[sub]] > data[3][data_offset[3]])

  out[i] = data[sub][data_offset[sub]]
  data_offset[sub]++


Edit: 编辑:
With AVX2 and its gather support, we could compare up to 8 subsets at once. 使用AVX2及其聚集支持,我们可以同时比较多达8个子集。


Edit 2: 编辑2:
Depending on type casting, it might be possible to shave off 3 extra clock cycles per iteration on a Nehalem (mul: 5, shift+sub: 4) 根据类型转换,有可能在Nehalem上每次迭代削减3个额外的时钟周期(mul:5,shift + sub:4)

//Assuming 'sub' is uint32_t
sub = ... << ((data[sub][data_offset[sub]] > data[...][data_offset[...]]) - 1)


Edit 3: 编辑3:
It may be possible to exploit out-of-order execution to some degree, especially as K gets larger, by using two or more max values: 通过使用两个或更多个max ,可能会在某种程度上利用无序执行,尤其是在K变大时:

max1 = 0
max2 = 1
max1 = 2 * (data[max1][data_offset[max1]] > data[2][data_offset[2]])
max2 = 3 * (data[max2][data_offset[max2]] > data[3][data_offset[3]])
...
max1 = 6 * (data[max1][data_offset[max1]] > data[6][data_offset[6]])
max2 = 7 * (data[max2][data_offset[max2]] > data[7][data_offset[7]])

q = data[max1][data_offset[max1]] < data[max2][data_offset[max2]]

sub = max1*q + ((~max2)&1)*q


Edit 4: 编辑4:

Depending on compiler intelligence, we can remove multiplications altogether using the ternary operator: 根据编译器的智能,我们可以使用三元运算符完全删除乘法:

sub = (data[sub][data_offset[sub]] > data[x][data_offset[x]]) ? x : sub


Edit 5: 编辑5:

In order to avoid costly floating point comparisons, we could simply reinterpret_cast<uint32_t*>() the data, as this would result in an integer compare. 为了避免代价高昂的浮点比较,我们可以简单地reinterpret_cast<uint32_t*>()数据,因为这会导致整数比较。

Another possibility is to utilize SSE registers as these are not typed, and explicitly use integer comparison instructions. 另一种可能性是利用SSE寄存器,因为它们不是键入的,并且明确地使用整数比较指令。

This works due to the operators < > == yielding the same results when interpreting a float on the binary level. 这是因为运算符< > ==在解释二进制级别的浮点时产生相同的结果。


Edit 6: 编辑6:

If we unroll our loop sufficiently to match the number of values to the number of SSE registers, we could stage the data that is being compared. 如果我们充分展开循环以使值的数量与SSE寄存器的数量相匹配,我们就可以对正在进行比较的数据进行分级。

At the end of an iteration we would then re-transfer the register which contained the selected maximum/minimum value, and shift it. 在迭代结束时,我们将重新传输包含所选最大/最小值的寄存器,并将其移位。

Although this requires reworking the indexing slightly, it may prove more efficient than littering the loop with LEA 's. 虽然这需要稍微重新编写索引,但它可能比使用LEA乱丢循环更有效。

这更像是一个研究课题,但我确实发现本文讨论了使用d-way合并排序最小化分支错误预测。

The most obvious answer that comes to mind is a standard N-way merge using a heap. 想到的最明显的答案是使用堆的标准N路合并。 That'll be O(N log k). 那将是O(N log k)。 The number of subsets is between 16 and 256, so the worst case behavior (with 256 subsets of 16 items each) would be 8N. 子集的数量在16到256之间,因此最坏情况的行为(每个16个项目的256个子集)将是8N。

Cache behavior should be ... reasonable, although not perfect. 缓存行为应该......合理,尽管不完美。 The heap, where most of the action is, will probably remain in the cache throughout. 大多数操作所在的堆可能始终保留在缓存中。 The part of the output array being written to will also most likely be in the cache. 写入的输出数组部分也很可能位于缓存中。

What you have is 16K of data (the array with sorted subsequences), the heap (1K, worst case), and the sorted output array (16K again), and you want it to fit into a 32K cache. 你拥有的是16K数据(带有排序子序列的数组),堆(1K,最坏情况)和排序输出数组(再次16K),并且你希望它适合32K缓存。 Sounds like a problem, but perhaps it isn't. 听起来像是一个问题,但也许不是。 The data that will most likely be swapped out is the front of the output array after the insertion point has moved. 最可能被换出的数据是插入点移动后输出数组的前面。 Assuming that the sorted subsequences are fairly uniformly distributed, they should be accessed often enough to keep them in the cache. 假设已排序的子序列是相当均匀分布的,应该经常访问它们以使它们保持在缓存中。

You can merge int arrays (expensive) branch free. 您可以合并int数组(昂贵)分支免费。

typedef unsigned uint;
typedef uint* uint_ptr;

void merge(uint*in1_begin, uint*in1_end, uint*in2_begin, uint*in2_end, uint*out){

  int_ptr in [] = {in1_begin, in2_begin};
  int_ptr in_end [] = {in1_end, in2_end};

  // the loop branch is cheap because it is easy predictable
  while(in[0] != in_end[0] && in[1] != in_end[1]){
    int i = (*in[0] - *in[1]) >> 31;
    *out = *in[i];
    ++out;
    ++in[i];
  }

  // copy the remaining stuff ...
}

Note that (*in[0] - *in[1]) >> 31 is equivalent to *in[0] - *in[1] < 0 which is equivalent to *in[0] < *in[1]. 注意(* [in] [*] in [1])>> 31等于* in [0] - * in [1] <0,相当于[in] [0] <* in [1]。 The reason I wrote it down using the bitshift trick instead of 我之所以用bithift技巧而不是写下来的原因

int i = *in[0] < *in[1];

is that not all compilers generate branch free code for the < version. 并非所有编译器都为<version生成分支免费代码。

Unfortunately you are using floats instead of ints which at first seems like a showstopper because I do not see how to realabily implement *in[0] < *in[1] branch free. 不幸的是你使用的是浮点数而不是整数,它们最初看起来像是一个showstopper,因为我没有看到如何在[1]分支中自由地实现* [0] <*。 However, on most modern architectures you interprete the bitpatterns of positive floats (that also are no NANs, INFs or such strange things) as ints and compare them using < and you will still get the correct result. 但是,在大多数现代体系结构中,您将正向浮点(也没有NAN,INF或类似的东西)的位模式解释为整数并使用<进行比较,您仍然可以获得正确的结果。 Perhaps you extend this observation to arbitrary floats. 也许你将这个观察延伸到任意浮标。

SIMD sorting algorithms have already been studied in detail. 已经详细研究了SIMD排序算法。 The paper Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture describes an efficient algorithm for doing what you describe (and much more). 论文在多核SIMD CPU架构上进行排序的高效实现描述了一种有效的算法,用于执行您所描述的内容(以及更多内容)。

The core idea is that you can reduce merging two arbitrarily long lists to merging blocks of k consecutive values (where k can range from 4 to 16): the first block is z[0] = merge(x[0], y[0]).lo . 核心思想是你可以减少合并两个任意长的列表来合并k个连续值的块(其中k的范围从4到16):第一个块是z[0] = merge(x[0], y[0]).lo To obtain the second block, we know that the leftover merge(x[0], y[0]).hi contains nx elements from x and ny elements from y , with nx+ny == k . 为了获得第二个块,我们知道剩余的merge(x[0], y[0]).hi包含来自x nx元素和来自y ny元素,其中nx+ny == k But z[1] cannot contain elements from both x[1] and y[1] , because that would require z[1] to contain more than nx+ny elements: so we just have to find out which of x[1] and y[1] needs to be added. 但是z[1]不能包含x[1]y[1]元素,因为这需要z[1]包含多于nx+ny元素:所以我们只需要找出x[1]哪一个并且y[1]需要添加。 The one with the lower first element will necessarily appear first in z , so this is simply done by comparing their first element. 具有较低第一元素的那个必然首先出现在z ,所以这只是通过比较它们的第一个元素来完成的。 And we just repeat that until there is no more data to merge. 我们只是重复一遍,直到没有更多的数据要合并。

Pseudo-code, assuming the arrays end with a +inf value: 伪代码,假设数组以+inf值结束:

a := *x++
b := *y++
while not finished:
    lo,hi := merge(a,b)
    *z++ := lo
    a := hi
    if *x[0] <= *y[0]:
        b := *x++
    else:
        b := *y++

(note how similar this is to the usual scalar implementation of merging) (注意这与通常的合并标量实现有多相似)

The conditional jump is of course not necessary in an actual implementation: for example, you could conditionally swap x and y with an xor trick, and then read unconditionally *x++ . 在实际实现中,条件跳转当然不是必需的:例如,您可以通过xor技巧有条件地交换xy ,然后无条件地读取*x++

merge itself can be implemented with a bitonic sort. merge本身可以用bitonic排序实现。 But if k is low, there will be a lot of inter-instruction dependencies resulting in high latency. 但是如果k很低,则会有很多指令间依赖性导致高延迟。 Depending on the number of arrays you have to merge, you can then choose k high enough so that the latency of merge is masked, or if this is possible interleave several two-way merges. 根据您必须合并的阵列数量,您可以选择k足够高,以便屏蔽merge的延迟,或者如果可以交错几个双向合并。 See the paper for more details. 有关详细信息,请参阅该文章。


Edit : Below is a diagram when k = 4. All asymptotics assume that k is fixed. 编辑 :下面是k = 4时的图。所有渐近线都假定k是固定的。

  • The big gray box is merging two arrays of size n = m * k (in the picture, m = 3). 大灰色框合并两个大小为n = m * k的数组(在图中, m = 3)。

    在此输入图像描述

    1. We operate on blocks of size k . 我们在大小为k的块上运行。
    2. The "whole-block merge" box merges the two arrays block-by-block by comparing their first elements. “整块合并”框通过比较它们的第一个元素来逐块合并两个数组。 This is a linear time operation, and it doesn't consume memory because we stream the data to the rest of the block. 这是一个线性时间操作,它不消耗内存,因为我们将数据流式传输到块的其余部分。 The performance doesn't really matter because the latency is going to be limited by the latency of the "merge4" blocks. 性能并不重要,因为延迟将受到“merge4”块延迟的限制。
    3. Each "merge4" box merges two blocks, outputs the lower k elements, and feeds the upper k elements to the next "merge4". 每个“merge4”框合并两个块,输出低k个元素,并将上k个元素馈送到下一个“merge4”。 Each "merge4" box performs a bounded number of operations, and the number of "merge4" is linear in n . 每个“merge4”框执行有限数量的操作,“merge4”的数量在n中是线性的。
    4. So the time cost of merging is linear in n . 因此,合并的时间成本在n中是线性的。 And because "merge4" has a lower latency than performing 8 serial non-SIMD comparisons, there will be a large speedup compared to non-SIMD merging. 并且因为“merge4”具有比执行8次串行非SIMD比较更低的延迟,所以与非SIMD合并相比将具有更大的加速。
  • Finally, to extend our 2-way merge to merge many arrays, we arrange the big gray boxes in classical divide-and-conquer fashion. 最后,为了扩展我们的双向合并以合并多个阵列,我们以经典的分而治之的方式安排了大灰盒子。 Each level has complexity linear in the number of elements, so the total complexity is O( n log ( n / n0 )) with n0 the initial size of the sorted arrays and n is the size of the final array. 每个级别的元素数量都具有线性复杂度,因此总复杂度为O( n log( n / n0 )),其中n0是排序数组的初始大小, n是最终数组的大小。

    图

You could do a simple merge kernel to merge K lists: 你可以做一个简单的合并内核来合并K列表:

float *input[K];
float *output;

while (true) {
  float min = *input[0];
  int min_idx = 0;
  for (int i = 1; i < K; i++) {
    float v = *input[i];
    if (v < min) {
      min = v;     // do with cmov
      min_idx = i; // do with cmov
    }
  }
  if (min == SENTINEL) break;
  *output++ = min;
  input[min_idx]++;
}

There's no heap, so it is pretty simple. 没有堆,所以很简单。 The bad part is that it is O(NK), which can be bad if K is large (unlike the heap implementation which is O(N log K)). 坏的部分是它是O(NK),如果K很大则可能是坏的(不像堆的实现是O(N log K))。 So then you just pick a maximum K (4 or 8 might be good, then you can unroll the inner loop), and do larger K by cascading merges (handle K=64 by doing 8-way merges of groups of lists, then an 8-way merge of the results). 那么你只需选择一个最大K(4或8可能是好的,然后你可以展开内循环),并通过级联合并做更大的K(通过对列表组进行8向合并来处理K = 64,然后是8路合并结果)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何快速计算C ++中向量的标准化l1和l2范数? - How to fast calculate the normalized l1 and l2 norm of a vector in C++? 如何清除L1,L2和L3缓存? - How to clear L1, L2 and L3 caches? 什么是LUT和类似的L1 / L2缓存行为? - What is L1/L2 cache behavior for LUTs and the alike? 使用C ++查找L1和L2缓存的大小 - Find the size of L1 and L2 cache with C++ 是否可以检查变量是否位于 L1/L2/L3 缓存 - Is it possible to check if a variable is located at L1/L2/L3 cache 在C ++中清除L1,L2和L3缓存的最正确方法是什么? - What would be the most correct way to clear L1,L2 and L3 caches in C++? 获取共享缓存的逻辑 CPU 内核数(L1、L2、L3) - Get the number of logical CPU cores sharing a cache (L1, L2, L3) 多少对象(包含std :: vectors)被加载到L1 / L2 / L3缓存中? - How much of an object (containing std::vectors) is loaded in to the L1/L2/L3 cache? 给定一个数组 l1,l2,l3 。 . . ln 创建一个新数组如下: (l1+ln),(l2+l[n-1]), . . .(l[n/2]+l[n/2+1]) - Given an array l1,l2,l3 . . . ln create a new array as follow: (l1+ln),(l2+l[n-1]), . . .(l[n/2]+l[n/2+1]) 测量l1 / l2缓存中加载的行数(包括预取)? - Measure the number of lines loaded in l1/l2 cache for reads(including prefetch)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM