[英]Using the blend instructions in intel intrinsics (AVX)
I have a question regarding the AVX _mm256_blend_pd
function.我有一个关于 AVX
_mm256_blend_pd
function 的问题。
I want to optimize my code where I use heavily the _mm256_blendv_pd
function.我想优化我大量使用
_mm256_blendv_pd
function 的代码。 This unfortunately has a pretty high latency and low throughput.不幸的是,这具有相当高的延迟和低吞吐量。 This function takes as input three
__m256d
variables where the last one represents the mask that is used to select from the first 2 variables.此 function 将三个
__m256d
变量作为输入,其中最后一个表示用于前 2 个变量中 select 的掩码。
I found another function ( _mm256_blend_pd
) which takes a bit mask instead of a __m256d
variable as mask.我发现了另一个 function (
_mm256_blend_pd
),它采用位掩码而不是__m256d
变量作为掩码。 When the mask is static I could simply pass something like 0b0111
to take the first element from the first variable and the last 3 elements of the second variable.当掩码为 static 时,我可以简单地传递
0b0111
之类的内容来获取第一个变量的第一个元素和第二个变量的最后 3 个元素。 However in my case the mask is computed using _mm_cmp_pd
function which returns a __m256d
variable.但是在我的情况下,掩码是使用
_mm_cmp_pd
function 计算的,它返回一个__m256d
变量。 I found out that I can use _mm256_movemask_pd
to return an int from the mask, however when passing this into the function _mm256_blend_pd
I get an error error: the last argument must be a 4-bit immediate
.我发现我可以使用
_mm256_movemask_pd
从掩码返回一个 int ,但是当将它传递给 function _mm256_blend_pd
时,我收到一个错误error: the last argument must be a 4-bit immediate
。
Is there a way to pass this integer using its first 4 bits?有没有办法通过这个 integer 使用它的前 4 位? Or is there another function similar to movemask that would allow me to use
_mm256_blend_pd
?或者是否还有另一个类似于 movemask 的 function 可以让我使用
_mm256_blend_pd
? Or is there another approach I can use to avoid having a cmp, movemask and blend that would be more efficient for this use case?或者我可以使用另一种方法来避免使用对这个用例更有效的 cmp、movemask 和 blend 吗?
_mm256_blend_pd
is the intrinsic for vblendpd
which takes its control operand as an immediate constant, embedded into the machine code of the instruction. _mm256_blend_pd
是vblendpd
的内在函数,它将其控制操作数作为立即常数,嵌入到指令的机器代码中。 (That's what "immediate" means in assembly / machine code terminology.) (这就是汇编/机器代码术语中“立即”的含义。)
In C++ terms, the control arg must be constexpr
so the compiler can embed it into the instruction at compile time.在 C++ 术语中,控制 arg 必须是
constexpr
,以便编译器可以在编译时将其嵌入到指令中。 You can't use it for runtime-variable blends.您不能将它用于运行时变量混合。
It's unfortunate that variable-blend instructions like vblendvpd
are slower, but they're "only" 2 uops on Skylake, with 1 or 2 cycle latency (depending on which input you're measuring the critical path through).不幸的是,像
vblendvpd
这样的可变混合指令速度较慢,但它们在 Skylake 上“只有”2 个微指令,具有 1 或 2 个周期延迟(取决于您测量关键路径所通过的输入)。 ( uops.info ). ( uops.info )。 And on Skylake those uops can run on any of the 3 vector ALU ports.
在 Skylake 上,这些微指令可以在 3 个矢量 ALU 端口中的任何一个上运行。 (Worse on Haswell/Broadwell, though, limited to port 5 only, competing for it with shuffles).
(不过,在 Haswell/Broadwell 上更糟糕,仅限于端口 5,通过随机播放来争夺它)。 Zen can even run them as a single uop.
Zen 甚至可以将它们作为单个 uop 运行。
There's nothing better for the general case until AVX512 makes masking a first-class operation you can do as part of other instructions, and gives us single-uop blend instructions like vblendmpd ymm0{k1}, ymm1, ymm2
(blend according to a mask register).在 AVX512 使屏蔽成为您可以作为其他指令的一部分执行的一流操作并为我们提供单微指令混合指令(如
vblendmpd ymm0{k1}, ymm1, ymm2
(根据屏蔽寄存器混合)。
In some special cases you can usefully _mm256_and_pd
to conditionally zero instead of blending, eg to zero an input before an add
instead of blending after.在某些特殊情况下,您可以有用地
_mm256_and_pd
有条件地置零而不是混合,例如,在add
之前将输入归零而不是在之后混合。
TL:DR: _mm256_blend_pd
lets you use a faster instruction for the special case where the control is a compile-time constant. TL:DR:
_mm256_blend_pd
允许您在控件是编译时常量的特殊情况下使用更快的指令。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.