简体   繁体   English

使用 AVX 对 64 位结构进行排序?

[英]Sorting 64-bit structs using AVX?

I have a 64-bit struct which represents several pieces of data, one of which is a floating point value:我有一个 64 位结构,它表示几条数据,其中之一是浮点值:

struct MyStruct{
    uint16_t a;
    uint16_t b;
    float f;
}; 

and I have four of these structs in, lets say an std::array<MyStruct, 4>我有四个这样的结构,比如说一个std::array<MyStruct, 4>

is it possible to use AVX to sort the array, in terms of the float member MyStruct::f ?是否可以根据浮点成员MyStruct::f使用 AVX 对数组进行排序?

Sorry this answer is messy;抱歉,这个答案很乱; it didn't all get written at once and I'm lazy.它并没有一次全部写完,我很懒。 There is some duplication.有一些重复。

I have 4 separate ideas:我有 4 个不同的想法:

  1. Normal sorting, but moving the struct as a 64bit unit正常排序,但将结构作为 64 位单元移动

  2. Vectorized insertion-sort as a building block for qsort向量化插入排序作为 qsort 的构建块

  3. Sorting networks, with a comparator implementation using cmpps / blendvpd instead of minps / maxps .使用cmpps / blendvpd而不是minps / maxps的比较器实现对网络进行排序。 The extra overhead might kill the speedup, though.不过,额外的开销可能会扼杀加速。

  4. Sorting networks: load some structs, then shuffle/blend to get some registers of just floats and some registers of just payload.对网络进行排序:加载一些结构,然后混洗/混合以获得一些仅浮点数的寄存器和一些仅有效负载的寄存器。 Use Timothy Furtak's technique of doing a normal minps / maxps comparator and then cmpeqps min,orig -> masked xor-swap on the payload.使用 Timothy Furtak 的技术做一个正常的minps / maxps比较器,然后在有效负载上使用cmpeqps min,orig -> masked xor-swap。 This sorts twice as much data per comparator, but does require matching shuffles on two registers between comparators.这对每个比较器排序两倍的数据,但确实需要在比较器之间的两个寄存器上匹配洗牌。 Also requires re-interleaving when you're done (but that's easy with unpcklps / unpckhps, if you arrange your comparators so those in-lane unpacks will put the final data in the right order).完成后还需要重新交错(但是使用 unpcklps / unpckhps 很容易,如果您安排比较器,以便那些通道内解包将最终数据按正确顺序放置)。

This also avoids potential slowdowns that some CPUs may have when doing FP comparisons on bit patterns in the payload that represent denormals, NaNs, or infinities, without resorting to setting the denormals-are-zero bit in MXCSR.这也避免了某些 CPU 在对负载中表示非正规、NaN 或无穷大的位模式进行 FP 比较时可能出现的潜在减速,而无需在 MXCSR 中设置非正规为零位。

Furtak's paper suggests doing a scalar cleanup after getting things mostly sorted with vectors, which would reduce the amount of shuffling a lot. Furtak 的论文建议在使用向量排序后进行标量清理,这将大大减少洗牌的数量。

Normal sorting正常排序

There's at least a small speedup to be gained when using normal sorting algorithms, by moving the whole struct around with 64bit loads/stores, and doing a scalar FP compare on the FP element.当使用正常的排序算法时,至少有一个小的加速,通过使用 64 位加载/存储移动整个结构,并对 FP 元素进行标量 FP 比较。 For this idea to work as well as possible, order your struct with the float value first, then you could movq a whole struct into an xmm reg , and the float value would be in the low32 for ucomiss .为了使这个想法尽可能好地工作,首先使用浮点值订购您的结构,然后您可以将整个结构movq放入 xmm reg中,浮点值将在 ucomiss 的ucomiss中。 Then you (or maybe a smart compiler) could store the struct with a movq .然后你(或者可能是一个智能编译器)可以用movq存储结构。

Looking at the asm output that Kerrek SB linked to, compilers seem to do a rather bad job of efficiently copying structs around:查看 Kerrek SB 链接到的 asm 输出,编译器似乎在有效复制结构方面做得相当糟糕:

icc seems to movzx the two uint values separately, rather than scooping up the whole struct in a 64b load. icc似乎分别 movzx 两个 uint 值,而不是在 64b 负载中舀出整个结构。 Maybe it doesn't pack the struct?也许它不打包结构? gcc 5.1 doesn't seem to have that problem most of the time. gcc 5.1 似乎大部分时间都没有这个问题。

Speeding up insertion-sort加速插入排序

Big sorts usually divide-and-conquer with insertion sort for small-enough problems.对于足够小的问题,大排序通常与插入排序分而治之。 Insertion sort copies array elements over by one, stopping only when we find we've reached the spot where the current element belongs.插入排序将数组元素逐一复制,仅当我们发现我们已经到达当前元素所属的位置时才停止。 So we need to compare one element to a sequence of packed elements, stopping if the comparison is true for any.所以我们需要将一个元素与一系列打包元素进行比较,如果比较为真则停止。 Do you smell vectors?你闻到矢量的味道吗? I smell vectors.我闻到矢量的味道。

# RSI points to  struct { float f; uint... payload; } buf[];
# RDI points to the next element to be inserted into the sorted portion
# [ rsi to rdi ) is sorted, the rest isn't.
##### PROOF OF CONCEPT: debug / finish writing before using!  ######

.new_elem:
vbroadcastsd ymm0, [rdi]      # broadcast the whole struct
mov rdx, rdi

.search_loop:
    sub        rdx, 32
    vmovups    ymm1, [rdx]    # load some sorted data
    vcmplt_oqps ymm2, ymm0, ymm1   # all-ones in any element where ymm0[i] < ymm1[i] (FP compare, false if either is NaN).
    vmovups    [rdx+8], ymm1  # shuffle it over to make space, usual insertion-sort style
    cmp        rdx, rsi
    jbe     .endsearch        # below-or-equal (addresses are unsigned)
    movmskps   eax, ymm2
    test       al, 0b01010101 # test only the compare results for 
              
    jz      .search_loop      # [rdi] wasn't less than any of the 4 elements

.endsearch:
# TODO: scalar loop to find out where the new element goes.
#  All we know is that it's less than one of the elements in ymm1, but not which
add           rdi, 8
vmovsd         [rdx], ymm0
cmp           rdi, r8   # pointer to the end of the buf
jle           .new_elem

  # worse alternative to movmskps / test:
  # vtestps    ymm2, ymm7     # where ymm7 is loaded with 1s in the odd (float) elements, and 0s in the even (payload) elements.
  # vtestps is like PTEST, but only tests the high bit.  If the struct was in the other order, with the float high, vtestpd against a register of all-1s would work, as that's more convenient to generate.

This is certainly full of bugs, and I should have just written it in C with intrinsics.这肯定充满了错误,我应该用 C 语言用内在函数编写它。

This is an insertion sort with probably more overhead than most, that might lose to a scalar version for very small problem sizes, due to the extra complexity of handling the first few element (don't fill a vector), and of figuring out where to put the new element after breaking out of the vector search loop that checked multiple elements.这是一种插入排序,其开销可能比大多数插入排序要多,对于非常小的问题规模,由于处理前几个元素(不填充向量)和找出位置的额外复杂性,它可能会输给标量版本在跳出检查多个元素的向量搜索循环后放置新元素。

Probably pipelining the loop so we haven't stored ymm1 until the next iteration (or after breaking out) would save a redundant store.可能对循环进行流水线操作,因此我们在下一次迭代(或中断之后)之前没有存储ymm1将节省冗余存储。 Doing the compares in registers by shifting / shuffling them, instead of literally doing scalar load/compares would probably be a win.通过移位/改组它们在寄存器中进行比较,而不是从字面上进行标量加载/比较可能是一个胜利。 This could end up with way too many unpredictable branches, and I'm not seeing a nice way to end up with the high 4 packed in a reg for vmovups , and the low one in another reg for vmovsd .这可能会导致太多不可预测的分支,而且我没有看到一个很好的方法来结束 vmovups 的 reg 中的高 4 和vmovups的另一个 reg 中的低vmovsd

I may have invented an insertion sort that's the worst of both worlds: slow for small arrays because of more work after breaking out of the search loop, but it's still insertion sort: slow for large arrays because of O(n^2).我可能发明了一种两全其美的插入排序:对于小数组来说很慢,因为在跳出搜索循环后需要做更多的工作,但它仍然是插入排序:对于大数组来说很慢,因为 O(n^2)。 However, if the code outside the searchloop can be made non-horrible, this could be a useful as the small-array endpoint for qsort / mergesort.但是,如果可以使 searchloop 之外的代码变得不可怕,那么这对于 qsort / mergesort 的小数组端点可能很有用。

Anyway, if anyone does develop this idea into actual debugged and working code, let us know.无论如何,如果有人确实将这个想法开发成实际调试和工作的代码,请告诉我们。

update: Timothy Furtak's paper describes an SSE implementation for sorting short arrays (for use as a building block for bigger sorts, like this insertion sort).更新: Timothy Furtak 的论文描述了一种用于对短数组进行排序的 SSE 实现(用作更大排序的构建块,例如这种插入排序)。 He suggests producing a partially-ordered result with SSE, and then doing a cleanup with scalar ops.他建议使用 SSE 生成部分有序的结果,然后使用标量操作进行清理。 (insertion-sort on a mostly-sorted array is fast.) (对大多数排序的数组进行插入排序很快。)

Which leads us to:这导致我们:

Sorting Networks排序网络

There might not be any speedup here.这里可能没有任何加速。 Xiaochen, Rocki, and Suda only report a 3.7x speedup from scalar -> AVX-512 for 32bit (int) elements, for single-threaded mergesort, on a Xeon Phi card. Xiaochen、Rocki 和 Suda 仅报告在 Xeon Phi 卡上,对于 32 位 (int) 元素、单线程合并排序,从 scalar -> AVX-512 的速度提升了 3.7 倍。 With wider elements, fewer fit in a vector reg.使用更宽的元素,更少适合向量 reg。 (That's a factor of 4 for us: 64b elements in 256b, vs. 32b elements in 512b.) They also take advantage of AVX512 masks to only compare some lanes, a feature not available in AVX. (这对我们来说是 4 倍:256b 中有 64b 个元素,而 512b 中有 32b 个元素。)他们还利用 AVX512 掩码仅比较一些通道,这是 AVX 中不可用的功能。 Plus, with a slower comparator function that competes for the shuffle/blend unit, we're already in worse shape.另外,由于比较器功能较慢,竞争洗牌/混合单元,我们已经处于更糟糕的状态。

Sorting networks can be constructed using SSE/AVX packed-compare instructions.可以使用 SSE/AVX 压缩比较指令构建排序网络 (More usually, with a pair of min/max instructions that effectively do a set of packed 2-element sorts.) Larger sorts can be built up out of an operation that does pairwise sorts. (更常见的是,使用一对最小/最大指令可以有效地执行一组打包的 2 元素排序。)更大的排序可以通过执行成对排序的操作来构建。 This paper by Tian Xiaochen, Kamil Rocki and Reiji Suda at U of Tokyo has some real AVX code for sorting (without payloads), and discussion of how it's tricky with vector registers because you can't compare two elements that are in the same register (so the sorting network has to be designed to not require that). 东京大学的 Tian Xiaochen、Kamil Rocki 和 Reiji Suda 的这篇论文有一些用于排序的真正 AVX 代码(没有有效负载),并讨论了向量寄存器的棘手之处,因为您无法比较同一寄存器中的两个元素(因此必须将分拣网络设计为不需要)。 They use pshufd to line up elements for the next comparison, to build up a larger sort out of sorting just a few registers full of data.他们使用pshufd来排列元素以进行下一次比较,以建立一个更大的排序,而不是仅对几个充满数据的寄存器进行排序。

Now, the trick is to do a sort of pairs of packed 64b elements, based on the comparison of only half an element .现在,诀窍是根据仅半个元素的比较来做一种成对的 64b 元素 (ie Keeping the payload with the sort key.) We could potentially sort other things this way, by sorting an array of (key, payload) pairs, where the payload can be an index or 32bit pointer ( mmap(MAP_32bit) , or x32 ABI). (即使用排序键保留有效负载。)我们可以通过排序(key, payload)对的数组来对其他事物进行排序,其中有效负载可以是索引或 32 位指针( mmap(MAP_32bit)或 x32 ABI)。

So let's build ourselves a comparator.因此,让我们自己构建一个比较器。 In sorting-network parlance, that's an operation that sorts a pair of inputs.用排序网络的说法,这是对一对输入进行排序的操作。 So it either swaps an elements between registers, or not.所以它要么在寄存器之间交换一个元素,要么不交换。

# AVX comparator for SnB/IvB
# struct { uint16_t a, b; float f; }  inputs in ymm0, ymm1
# NOTE: struct order with f second saves a shuffle to extend the mask

vcmpps    ymm7, ymm0, ymm1, _CMP_LT_OQ  # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
     # ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
# vblendvpd checks the high bit of the 64b element, so mask *doesn't* need to be extended to the low32
vblendvpd ymm2, ymm1, ymm0, ymm7
vblendvpd ymm3, ymm0, ymm1, ymm7
# result: !(ymm2[i] > ymm3[i])  (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (NaN).)
#  UNTESTED

You might need to set the MXCSR to make sure that int bits don't slow down your FP ops if they happen to represent a denormal or NaN float.您可能需要设置 MXCSR 以确保如果 int 位恰好代表非正规或 NaN 浮点数,它们不会减慢您的 FP 操作。 I'm not sure if that happens only for mul/div, or if it would affect compare.我不确定这是否只发生在 mul/div 上,或者它是否会影响比较。

  • Intel Haswell: Latency: 5 cycles for ymm2 to be ready, 7 cycles for ymm3 . Intel Haswell:延迟:ymm2 准备好 5 个周期, ymm2准备好 7 个ymm3 Throughput: one per 4 cycles .吞吐量:每 4 个周期 1 个 (p5 bottleneck). (p5 瓶颈)。
  • Intel Sandybridge/Ivybridge: Latency: 5 cycles for ymm2 to be ready, 6 cycles for ymm3 . Intel Sandybridge/Ivybridge:延迟: ymm2 准备好 5 个周期, ymm2准备好 6 个ymm3 Throughput: one per 2 cycles .吞吐量:每 2 个周期 1 个 (p0/p5 bottleneck). (p0/p5 瓶颈)。
  • AMD Bulldozer/Piledriver: ( vblendvpd ymm : 2c lat, 2c recip tput): lat: 4c for ymm2 , 6c for ymm3 . AMD 推土机/打桩机:( vblendvpd ymm :2c lat,2c recip tput):lat:4c 用于ymm2用于ymm3 Or worse, with bypass delays between cmpps and blend.或者更糟糕的是,在 cmpps 和 blend 之间存在旁路延迟。 tput: one per 4c . tput:每 4c 一个 (bottleneck on vector P1) (向量 P1 的瓶颈)
  • AMD Steamroller: ( vblendvpd ymm : 2c lat, 1c recip tput): lat: 4c for ymm2 , 5c for ymm3 . AMD Steamroller:( vblendvpd ymm :2c lat,1c recip tput):lat: ymm3为 4c, ymm2为 5c。 or maybe 1 higher because of bypass delays.或者可能由于旁路延迟而高出 1 个。 tput: one per 3c (bottleneck on vector ports P0/1, for cmp and blend). tput:每 3c 一个(矢量端口 P0/1 上的瓶颈,用于 cmp 和 blend)。

VBLENDVPD is 2 uops. VBLENDVPD为 2 微秒。 (It has 3 reg inputs, so it can't be 1 uop :/). (它有 3 个 reg 输入,所以它不能是 1 uop :/)。 Both uops can only run on shuffle ports.两个 uops 都只能在 shuffle 端口上运行。 On Haswell, that's only port5.在 Haswell 上,这只是端口 5。 On SnB, that's p0/p5.在 SnB 上,这是 p0/p5。 (IDK why Haswell halved the shuffle / blend throughput compared to SnB/IvB.) (IDK 为什么与 SnB/IvB 相比,Haswell 将 shuffle / blend 吞吐量减半。)

If AMD designs had 256b-wide vector units, their lower-latency FP compare and single-macro-op decoding of 3-input instructions would put them ahead.如果 AMD 设计有 256b 宽的向量单元,它们的低延迟 FP 比较和 3 输入指令的单宏操作解码将使它们领先。

The usual minps/maxps pair is 3 and 4 cycles latency ( ymm2/3 ), and one per 2 cycles throughput (Intel).通常的 minps/maxps 对是 3 和 4 个周期的延迟( ymm2/3 ),以及每 2 个周期的吞吐量(英特尔)。 (p1 bottleneck on the FP add/sub/compare unit). (FP 添加/子/比较单元上的 p1 瓶颈)。 The most fair comparison is probably to sorting 64bit doubles.最公平的比较可能是对 64 位双精度数进行排序。 The extra latency, may hurt if there aren't multiple pairs of independent registers to be compared.如果没有要比较的多对独立寄存器,则额外的延迟可能会受到伤害。 The halved throughput on Haswell will cut into any speedups pretty heavily. Haswell 上减半的吞吐量将大大减少任何加速。

Also keep in mind that shuffles are needed between comparator operations to get the right elements lined up for comparison.还要记住,比较器操作之间需要洗牌,以便将正确的元素排列好进行比较。 min/maxps leave the shuffle ports unused, but my cmpps/blendv version saturates them, meaning the shuffling can't overlap with comparing, except as something to fill gaps left by data dependencies. min/maxps 未使用 shuffle 端口,但我的 cmpps/blendv 版本使它们饱和,这意味着 shuffle 不能与比较重叠,除非是为了填补数据依赖关系留下的空白。

With hyperthreading, another thread that can keep the other ports busy (eg port 0/1 fp mul/add units, or integer code) would share a core quite nicely with this blend-bottlenecked version.使用超线程,另一个可以保持其他端口忙碌的线程(例如,端口 0/1 fp mul/add 单元或整数代码)将与这个混合瓶颈版本很好地共享一个内核。

I attempted another version for Haswell, which does the blends "manually" using bitwise AND/OR operations.我尝试了 Haswell 的另一个版本,它使用按位 AND/OR 操作“手动”进行混合。 It ended up slower, though, because both sources have to get masked both ways before combining.但是,它最终变慢了,因为在合并之前,两个来源都必须双向屏蔽。

# AVX2 comparator for Haswell
# struct { float f; uint16_t a, b; }  inputs in ymm0, ymm1
#
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ  # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
     # ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
vshufps ymm7, ymm7, ymm7, mask(0, 0, 2, 2)  # extend the mask to the payload part.  There's no mask function, I just don't want to work out the result in my head.
vpand    ymm10, ymm7, ymm0       # ymm10 = ymm0 keeping elements where ymm0[i] < ymm1[i]
vpandn   ymm11, ymm7, ymm1       # ymm11 = ymm1 keeping elements where !(ymm0[i] < ymm1[i])
vpor     ymm2, ymm10, ymm11      # ymm2 = min_packed_mystruct(ymm0, ymm1)

vpandn   ymm10, ymm7, ymm0       # ymm10 = ymm0 keeping elements where !(ymm0[i] < ymm1[i])
vpand    ymm11, ymm7, ymm1       # ymm11 = ymm1 keeping elements where ymm0[i] < ymm1[i]
vpor     ymm3, ymm10, ymm11  # ymm2 = max_packed_mystruct(ymm0, ymm1)

# result: !(ymm2[i] > ymm3[i])
#  UNTESTED

This is 8 uops, compared to 5 for the blendv version.这是 8 个微指令,而 blendv 版本是 5 个微指令。 There's a lot of parallelism in the last 6 and/andn/or instructions.最后 6 条和/和/或指令中有很多并行性。 cmpps has 3 cycle latency, though. cmpps有 3 个周期延迟。 I think ymm2 will be ready in 6 cycles, while ymm3 is ready in 7. (And can overlap with operations on ymm2 ).我认为ymm2将在 6 个周期内准备好,而ymm3在 7 个周期内准备好。(并且可以与ymm2上的操作重叠)。 The insns following a comparator op will probably be shuffles, to put the data in the right elements for the next compare.比较器操作之后的 insns 可能会被洗牌,以便将数据放入正确的元素中以进行下一次比较。 There's no forwarding delay to/from the shuffle unit for integer-domain logicals, even for a vshufps , but the result should come out in the FP domain, ready for a vcmpps .对于整数域逻辑,即使是vshufps也没有进/出混洗单元的转发延迟,但结果应该出现在 FP 域中,为vcmpps做好准备。 Using vpand instead of vandps is essential for throughput.使用vpand代替vandps对吞吐量至关重要。

Timothy Furtak's paper suggests an approach for sorting keys with a payload: don't pack the payload pointers with the keys, but instead generate a mask from the compare, and use it on both the keys and the payload the same way. Timothy Furtak 的论文提出了一种使用有效载荷对密钥进行排序的方法:不要将有效载荷指针与密钥打包,而是从比较中生成一个掩码,并以相同的方式在密钥和有效载荷上使用它。 This means you have to separate the payload from the keys either in your data structure, or every time you load a struct.这意味着您必须在数据结构中或每次加载结构时将有效负载与键分开。

See the appendix of his paper (Fig. 12).见他论文的附录(图 12)。 He uses the standard min/max on the keys, and then uses cmpps to see which elements CHANGED.他使用键上的标准最小/最大值,然后使用cmpps查看哪些元素已更改。 Then he ANDs that mask in the middle of an xor-swap to end up only swapping the payloads for the keys that swapped.然后,他在 xor-swap 中间对掩码进行 AND 运算,最终仅将有效负载交换为交换的密钥。

Unfortunately, original AVX has very limited shuffling across its 128-bit halves (ie lanes ), so it is hard to sort contents of a full 256-bit register.不幸的是,原始 AVX 在其 128 位半(即通道)中的改组非常有限,因此很难对完整的 256 位寄存器的内容进行排序。 However, AVX2 has shuffling operations without such limitations, so we can perform a sort of 4 structs in vectorized way.但是,AVX2 的混洗操作没有这些限制,因此我们可以以向量化的方式执行一种 4 结构。

I'll use the idea of this solution .我将使用这个解决方案的想法。 In order to sort an array we have to do enough element comparisons to surely determine the permutation we need to apply.为了对数组进行排序,我们必须进行足够的元素比较以确定我们需要应用的排列。 Given that no element is NaN, it is enough to check for each pair of different elements a and b whether a < b and whether a > b .鉴于没有元素是 NaN,检查每对不同的元素ab是否a < b以及是否a > b就足够了。 Having this information, we can fully compare any two elements, which must be enough to determine final sorting order.有了这些信息,我们就可以充分比较任意两个元素,这必须足以确定最终的排序顺序。 This is 6 pairs of 32-bit elements and two comparison modes, so we can end up doing two shuffles and two comparisons in AVX.这是 6 对 32 位元素和两种比较模式,所以我们最终可以在 AVX 中进行两次随机播放和两次比较。 If you are absolutely sure that all the elements are distinct, then you can avoid a > b comparisons and reduce size of LUT.如果您绝对确定所有元素都是不同的,那么您可以避免a > b比较并减小 LUT 的大小。

For repacking of elements within register we can use _mm256_permutevar8x32_ps .为了重新打包寄存器中的元素,我们可以使用_mm256_permutevar8x32_ps One instruction allows to do arbitrary shuffle on 32-bit granularity.一条指令允许在 32 位粒度上进行任意洗牌。 Note that in the code I assume that sorting key f is the first member of your struct (just as @PeterCordes proposed), but you can trivially use this solution for you current struct if you change shuffling mask accordingly.请注意,在代码中,我假设排序键f是您的结构的第一个成员(就像@PeterCordes 建议的那样),但是如果您相应地更改改组掩码,您可以轻松地为您当前的结构使用此解决方案。

After we perform the comparisons, we have a two AVX registers containing boolean results as 32-bit masks.在我们执行比较之后,我们有两个 AVX 寄存器,其中包含布尔结果作为 32 位掩码。 The first six masks in each register are important, the last two are not.每个寄存器中的前六个掩码很重要,后两个不重要。 Then we want to convert these masks to a small integer in general-purpose register to be used as index in a lookup table.然后我们想将这些掩码转换为通用寄存器中的一个小整数,用作查找表中的索引。 In general case we may have to create perfect hashing for it, but it is not necessary here.在一般情况下,我们可能必须为其创建完美的散列,但这里没有必要。 We can use _mm256_movemask_ps to get a 8-bit integer mask in general purpose register from AVX register.我们可以使用_mm256_movemask_ps从 AVX 寄存器中获取通用寄存器中的 8 位整数掩码。 Since the last two masks per register are not important, we can ensure that they are always zero.由于每个寄存器的最后两个掩码并不重要,我们可以确保它们始终为零。 Then the resulting index would be in range [0..2^12).然后生成的索引将在 [0..2^12) 范围内。

Finally, we load a shuffling mask from precomputed LUT with 4096 elements and pass it to _mm256_permutevar8x32_ps .最后,我们从具有 4096 个元素的预计算 LUT 中加载一个混洗掩码,并将其传递给_mm256_permutevar8x32_ps As a result we obtain an AVX register with 4 properly sorted structs of your type.结果,我们获得了一个 AVX 寄存器,其中包含 4 个正确排序的类型结构。 Precomputing the LUT is your home assignment =)预先计算 LUT 是您的家庭作业 =)

Here is the final code:这是最终代码:

__m256i lut[4096];    //LUT of 128Kb size must be precomputed
__m256 Sort4(__m256 val) {
    __m256 aaabbcaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(0, 0, 0, 2, 2, 4, 0, 0));
    __m256 bcdcddaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(2, 4, 6, 4, 6, 6, 0, 0));
    __m256 cmpLt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_LT_OQ);
    __m256 cmpGt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_GT_OQ);
    int idxLt = _mm256_movemask_ps(cmpLt);
    int idxGt = _mm256_movemask_ps(cmpGt);
    __m256i shuf = lut[idxGt * 64 + idxLt];
    __m256 res = _mm256_permutevar8x32_ps(val, shuf);
    return res;
}

Here you can see generated assembly.在这里您可以看到生成的程序集。 There are 14 instructions in total, 2 of them are for loading constant shuffling masks, and one of them is due to useless 32-bit->64-bit conversion of movemask results.总共有 14 条指令,其中 2 条用于加载常量 shuffle 掩码,其中 1 条是由于movemask结果的无用 32-bit->64-bit 转换。 So in a tight loop it would be 11-12 instructions.因此,在一个紧密的循环中,它将是 11-12 条指令。 IACA says that four calls in a loop have 16.40 cycles throughput on Haswell, so it seems to achieve throughput 4.1 cycles per call. IACA 表示循环中的四个调用在 Haswell 上具有 16.40 个周期的吞吐量,因此它似乎实现了每个调用 4.1 个周期的吞吐量。

Of course 128 Kb lookup table is too much unless you are going to process even more input data in one batch.当然 128 Kb 的查找表太大了,除非您打算在一批中处理更多的输入数据。 It may be possible to reduce LUT size with adding perfect hashing (sacrificing speed of course).可以通过添加完美的散列来减小 LUT 的大小(当然会牺牲速度)。 It is hard to say how much orderings are possible on four elements, but clearly less than 4!很难说四个元素有多少排序,但显然小于4! * 2^3 = 192 . * 2^3 = 192 I think 256-element LUT is possible, maybe even 128-element LUT.我认为 256 元素 LUT 是可能的,甚至可能是 128 元素 LUT。 With perfect hashing it may be faster to combine two AVX registers into one with shift and xor, then do _mm256_movemask_epi8 once (instead of doing two _mm256_movemask_ps and combining them afterwards).使用完美的散列,使用移位和异或将两个 AVX 寄存器合并为一个可能会更快,然后执行_mm256_movemask_epi8一次(而不是执行两个_mm256_movemask_ps并随后将它们合并)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM