简体繁体中英

Using the blend instructions in intel intrinsics (AVX)

原文 2020-05-21 02:07:29 4 1 c++/ c/ intrinsics/ avx/ immediate-operand

I have a question regarding the AVX _mm256_blend_pd function.

I want to optimize my code where I use heavily the _mm256_blendv_pd function. This unfortunately has a pretty high latency and low throughput. This function takes as input three __m256d variables where the last one represents the mask that is used to select from the first 2 variables.

I found another function ( _mm256_blend_pd ) which takes a bit mask instead of a __m256d variable as mask. When the mask is static I could simply pass something like 0b0111 to take the first element from the first variable and the last 3 elements of the second variable. However in my case the mask is computed using _mm_cmp_pd function which returns a __m256d variable. I found out that I can use _mm256_movemask_pd to return an int from the mask, however when passing this into the function _mm256_blend_pd I get an error error: the last argument must be a 4-bit immediate .

Is there a way to pass this integer using its first 4 bits? Or is there another function similar to movemask that would allow me to use _mm256_blend_pd ? Or is there another approach I can use to avoid having a cmp, movemask and blend that would be more efficient for this use case?

1 answers

_mm256_blend_pd is the intrinsic for vblendpd which takes its control operand as an immediate constant, embedded into the machine code of the instruction. (That's what "immediate" means in assembly / machine code terminology.)

In C++ terms, the control arg must be constexpr so the compiler can embed it into the instruction at compile time. You can't use it for runtime-variable blends.

It's unfortunate that variable-blend instructions like vblendvpd are slower, but they're "only" 2 uops on Skylake, with 1 or 2 cycle latency (depending on which input you're measuring the critical path through). ( uops.info ). And on Skylake those uops can run on any of the 3 vector ALU ports. (Worse on Haswell/Broadwell, though, limited to port 5 only, competing for it with shuffles). Zen can even run them as a single uop.

There's nothing better for the general case until AVX512 makes masking a first-class operation you can do as part of other instructions, and gives us single-uop blend instructions like vblendmpd ymm0{k1}, ymm1, ymm2 (blend according to a mask register).

In some special cases you can usefully _mm256_and_pd to conditionally zero instead of blending, eg to zero an input before an add instead of blending after.

TL:DR: _mm256_blend_pd lets you use a faster instruction for the special case where the control is a compile-time constant.

Intel assembly vs Intrinsics, AVX

Forcing AVX intrinsics to use SSE instructions instead

Intel AVX intrinsics: any compatibility library out?

Unknown type name __m256 - Intel intrinsics for AVX not recognized?

Shifiting xmm integer register values using non-AVX instructions on Intel x86 architecture

Fast dot product using SSE/AVX intrinsics

Manually control Intel MIC SIMD operations by intrinsics or instructions

Using AVX CPU instructions: Poor performance without “/arch:AVX”

std::array of AVX intrinsics

SSE and AVX intrinsics mixture

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Intel assembly vs Intrinsics, AVX Forcing AVX intrinsics to use SSE instructions instead Intel AVX intrinsics: any compatibility library out? Unknown type name __m256 - Intel intrinsics for AVX not recognized? Shifiting xmm integer register values using non-AVX instructions on Intel x86 architecture Fast dot product using SSE/AVX intrinsics Manually control Intel MIC SIMD operations by intrinsics or instructions Using AVX CPU instructions: Poor performance without “/arch:AVX” std::array of AVX intrinsics SSE and AVX intrinsics mixture

Related Tags

Using the blend instructions in intel intrinsics (AVX)

Question

1 answers

solution1 3 ACCPTED 2020-05-21 02:19:24

solution1
3 ACCPTED 2020-05-21 02:19:24