I have a question regarding the AVX _mm256_blend_pd
function.
I want to optimize my code where I use heavily the _mm256_blendv_pd
function. This unfortunately has a pretty high latency and low throughput. This function takes as input three __m256d
variables where the last one represents the mask that is used to select from the first 2 variables.
I found another function ( _mm256_blend_pd
) which takes a bit mask instead of a __m256d
variable as mask. When the mask is static I could simply pass something like 0b0111
to take the first element from the first variable and the last 3 elements of the second variable. However in my case the mask is computed using _mm_cmp_pd
function which returns a __m256d
variable. I found out that I can use _mm256_movemask_pd
to return an int from the mask, however when passing this into the function _mm256_blend_pd
I get an error error: the last argument must be a 4-bit immediate
.
Is there a way to pass this integer using its first 4 bits? Or is there another function similar to movemask that would allow me to use _mm256_blend_pd
? Or is there another approach I can use to avoid having a cmp, movemask and blend that would be more efficient for this use case?
_mm256_blend_pd
is the intrinsic for vblendpd
which takes its control operand as an immediate constant, embedded into the machine code of the instruction. (That's what "immediate" means in assembly / machine code terminology.)
In C++ terms, the control arg must be constexpr
so the compiler can embed it into the instruction at compile time. You can't use it for runtime-variable blends.
It's unfortunate that variable-blend instructions like vblendvpd
are slower, but they're "only" 2 uops on Skylake, with 1 or 2 cycle latency (depending on which input you're measuring the critical path through). ( uops.info ). And on Skylake those uops can run on any of the 3 vector ALU ports. (Worse on Haswell/Broadwell, though, limited to port 5 only, competing for it with shuffles). Zen can even run them as a single uop.
There's nothing better for the general case until AVX512 makes masking a first-class operation you can do as part of other instructions, and gives us single-uop blend instructions like vblendmpd ymm0{k1}, ymm1, ymm2
(blend according to a mask register).
In some special cases you can usefully _mm256_and_pd
to conditionally zero instead of blending, eg to zero an input before an add
instead of blending after.
TL:DR: _mm256_blend_pd
lets you use a faster instruction for the special case where the control is a compile-time constant.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.