简体   繁体   English

将 uint64_t 数组转换为 __m256i

[英]convert array of uint64_t to __m256i

I have four uint64_t numbers and I wish to combine them as parts of a __m256i , however, I'm lost as to how to go about this.我有四个uint64_t数字,我希望将它们组合为__m256i ,但是,我不知道如何进行此操作。

Here's one attempt (where rax , rbx , rcx , and rdx are uint64_t ):这是一种尝试(其中raxrbxrcxrdxuint64_t ):

uint64_t a [4] = {rax,rbx,rcx,rcx};

__m256i t = _mm256_load_si256((__m256i *) &a);

If you already have an array, then yes absolutely use _mm256_loadu_si256 (or even the aligned version, _mm256_load_si256 if your array is alignas(32) .) But generally don't create an array just to store into / reload from.如果你已经有一个数组,那么绝对使用_mm256_loadu_si256 (或者甚至是对齐版本, _mm256_load_si256如果你的数组是alignas(32) 。)但通常不要创建一个数组只是为了存储到/重新加载。


Use the _mm_set intrinsics and let the compiler decide how to do it.使用_mm_set内在函数并让编译器决定如何去做。 Note that they take their args with the highest-numbered element first: eg请注意,他们首先使用编号最高的元素获取参数:例如

__m256i vt = _mm256_set_epi64x(rdx, rcx, rbx, rax);

You typically don't want the asm to look anything like your scalar store -> vector load C source, because that would produce a store-forwarding stall.您通常不希望 asm 看起来像您的标量存储 -> 向量加载 C 源,因为这会产生存储转发停顿。

gcc 6.1 "sees through" the local array in this case (and uses 2x vmovq / 2x vpinsrq / 1x vinserti128 ), but it still generates code to align the stack to 32B.在这种情况下,gcc 6.1“看穿”了本地数组(并使用 2x vmovq / 2x vpinsrq / 1x vinserti128 ),但它仍会生成代码以将堆栈与 32B 对齐。 (Even though it's not needed because it didn't end up needing any 32B-aligned locals). (即使不需要它,因为它最终不需要任何 32B 对齐的本地人)。

As you can see on the Godbolt Compiler Explorer , the actual data-movement part of both ways is the same, but the array way has a bunch of wasted instructions that gcc failed to optimize away after deciding to avoid the bad way that the source was implying.正如您在Godbolt Compiler Explorer上看到的,两种方式的实际数据移动部分是相同的,但是数组方式有一堆浪费的指令,gcc 在决定避免源代码的糟糕方式后未能优化掉这些指令暗示。

_mm256_set_epi64x works in 32bit code (with gcc at least). _mm256_set_epi64x适用于 32 位代码(至少使用 gcc)。 You get 2x vmovq and 2x vmovhps to do 64bit loads to the upper half of an xmm register.您将获得 2x vmovq和 2x vmovhps来对 xmm 寄存器的上半部分进行 64 位加载。 (Add -m32 to the compile options in the godbolt link). (将-m32添加到 Godbolt 链接中的编译选项)。

Firstly, make sure your CPU even supports these AVX instructions: Performing AVX integer operation .首先,确保您的 CPU 甚至支持这些 AVX 指令: Performing AVX integer operation

Secondly, from https://software.intel.com/en-us/node/514151 , the pointer argument must be an aligned location.其次,从https://software.intel.com/en-us/node/514151 开始,指针参数必须是对齐的位置。 Conventionally allocated memory addresses on the stack are random and depend on the sizes of stack frames from previous calls, so may not be aligned.堆栈上传统分配的内存地址是随机的,取决于来自先前调用的堆栈帧的大小,因此可能不会对齐。

Instead, just use the intrinsic type __m256i to force the compiler to align it;相反,只需使用内部类型__m256i强制编译器对齐它; OR , according to https://software.intel.com/en-us/node/582952 , use __declspec(align) on your a array.或者,根据https://software.intel.com/en-us/node/582952 ,使用__declspec(align)你对a阵列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM