简体   繁体   中英

Using the blend instructions in intel intrinsics (AVX)

I have a question regarding the AVX _mm256_blend_pd function.

I want to optimize my code where I use heavily the _mm256_blendv_pd function. This unfortunately has a pretty high latency and low throughput. This function takes as input three __m256d variables where the last one represents the mask that is used to select from the first 2 variables.

I found another function ( _mm256_blend_pd ) which takes a bit mask instead of a __m256d variable as mask. When the mask is static I could simply pass something like 0b0111 to take the first element from the first variable and the last 3 elements of the second variable. However in my case the mask is computed using _mm_cmp_pd function which returns a __m256d variable. I found out that I can use _mm256_movemask_pd to return an int from the mask, however when passing this into the function _mm256_blend_pd I get an error error: the last argument must be a 4-bit immediate .

Is there a way to pass this integer using its first 4 bits? Or is there another function similar to movemask that would allow me to use _mm256_blend_pd ? Or is there another approach I can use to avoid having a cmp, movemask and blend that would be more efficient for this use case?

_mm256_blend_pd is the intrinsic for vblendpd which takes its control operand as an immediate constant, embedded into the machine code of the instruction. (That's what "immediate" means in assembly / machine code terminology.)

In C++ terms, the control arg must be constexpr so the compiler can embed it into the instruction at compile time. You can't use it for runtime-variable blends.

It's unfortunate that variable-blend instructions like vblendvpd are slower, but they're "only" 2 uops on Skylake, with 1 or 2 cycle latency (depending on which input you're measuring the critical path through). ( uops.info ). And on Skylake those uops can run on any of the 3 vector ALU ports. (Worse on Haswell/Broadwell, though, limited to port 5 only, competing for it with shuffles). Zen can even run them as a single uop.

There's nothing better for the general case until AVX512 makes masking a first-class operation you can do as part of other instructions, and gives us single-uop blend instructions like vblendmpd ymm0{k1}, ymm1, ymm2 (blend according to a mask register).

In some special cases you can usefully _mm256_and_pd to conditionally zero instead of blending, eg to zero an input before an add instead of blending after.


TL:DR: _mm256_blend_pd lets you use a faster instruction for the special case where the control is a compile-time constant.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM