简体繁体 English

在 intel 内部函数 (AVX) 中使用混合指令

[英]Using the blend instructions in intel intrinsics (AVX)

原文 2020-05-21 02:07:29 2 1 c++/ c/ intrinsics/ avx/ immediate-operand

I have a question regarding the AVX _mm256_blend_pd function.我有一个关于 AVX _mm256_blend_pd function 的问题。

I want to optimize my code where I use heavily the _mm256_blendv_pd function.我想优化我大量使用_mm256_blendv_pd function 的代码。 This unfortunately has a pretty high latency and low throughput.不幸的是，这具有相当高的延迟和低吞吐量。 This function takes as input three __m256d variables where the last one represents the mask that is used to select from the first 2 variables.此 function 将三个__m256d变量作为输入，其中最后一个表示用于前 2 个变量中 select 的掩码。

I found another function ( _mm256_blend_pd ) which takes a bit mask instead of a __m256d variable as mask.我发现了另一个 function ( _mm256_blend_pd )，它采用位掩码而不是__m256d变量作为掩码。 When the mask is static I could simply pass something like 0b0111 to take the first element from the first variable and the last 3 elements of the second variable.当掩码为 static 时，我可以简单地传递0b0111之类的内容来获取第一个变量的第一个元素和第二个变量的最后 3 个元素。 However in my case the mask is computed using _mm_cmp_pd function which returns a __m256d variable.但是在我的情况下，掩码是使用_mm_cmp_pd function 计算的，它返回一个__m256d变量。 I found out that I can use _mm256_movemask_pd to return an int from the mask, however when passing this into the function _mm256_blend_pd I get an error error: the last argument must be a 4-bit immediate .我发现我可以使用_mm256_movemask_pd从掩码返回一个 int ，但是当将它传递给 function _mm256_blend_pd时，我收到一个错误error: the last argument must be a 4-bit immediate 。

Is there a way to pass this integer using its first 4 bits?有没有办法通过这个 integer 使用它的前 4 位？ Or is there another function similar to movemask that would allow me to use _mm256_blend_pd ?或者是否还有另一个类似于 movemask 的 function 可以让我使用_mm256_blend_pd ？ Or is there another approach I can use to avoid having a cmp, movemask and blend that would be more efficient for this use case?或者我可以使用另一种方法来避免使用对这个用例更有效的 cmp、movemask 和 blend 吗？

1 个解决方案

_mm256_blend_pd is the intrinsic for vblendpd which takes its control operand as an immediate constant, embedded into the machine code of the instruction. _mm256_blend_pd是vblendpd的内在函数，它将其控制操作数作为立即常数，嵌入到指令的机器代码中。 (That's what "immediate" means in assembly / machine code terminology.) （这就是汇编/机器代码术语中“立即”的含义。）

In C++ terms, the control arg must be constexpr so the compiler can embed it into the instruction at compile time.在 C++ 术语中，控制 arg 必须是constexpr ，以便编译器可以在编译时将其嵌入到指令中。 You can't use it for runtime-variable blends.您不能将它用于运行时变量混合。

It's unfortunate that variable-blend instructions like vblendvpd are slower, but they're "only" 2 uops on Skylake, with 1 or 2 cycle latency (depending on which input you're measuring the critical path through).不幸的是，像vblendvpd这样的可变混合指令速度较慢，但它们在 Skylake 上“只有”2 个微指令，具有 1 或 2 个周期延迟（取决于您测量关键路径所通过的输入）。 ( uops.info ). ( uops.info )。 And on Skylake those uops can run on any of the 3 vector ALU ports.在 Skylake 上，这些微指令可以在 3 个矢量 ALU 端口中的任何一个上运行。 (Worse on Haswell/Broadwell, though, limited to port 5 only, competing for it with shuffles). （不过，在 Haswell/Broadwell 上更糟糕，仅限于端口 5，通过随机播放来争夺它）。 Zen can even run them as a single uop. Zen 甚至可以将它们作为单个 uop 运行。

There's nothing better for the general case until AVX512 makes masking a first-class operation you can do as part of other instructions, and gives us single-uop blend instructions like vblendmpd ymm0{k1}, ymm1, ymm2 (blend according to a mask register).在 AVX512 使屏蔽成为您可以作为其他指令的一部分执行的一流操作并为我们提供单微指令混合指令（如vblendmpd ymm0{k1}, ymm1, ymm2 （根据屏蔽寄存器混合）。

In some special cases you can usefully _mm256_and_pd to conditionally zero instead of blending, eg to zero an input before an add instead of blending after.在某些特殊情况下，您可以有用地_mm256_and_pd有条件地置零而不是混合，例如，在add之前将输入归零而不是在之后混合。

TL:DR: _mm256_blend_pd lets you use a faster instruction for the special case where the control is a compile-time constant. TL:DR: _mm256_blend_pd允许您在控件是编译时常量的特殊情况下使用更快的指令。