Tag[avx2] Recent Newest Questions

Extracting edges of AVX2 16x16 bitmatrix

Is there a relatively cheap way to extract the four edges (rows 0 and 15, and columns 0 and 15) of a 16x16 bitmatrix stored in a __m256i into four 16b ...

How would I define the __m256i data type in Ada?

I am trying to write a library for AVX2 in Ada 2012 using the GNAT GCC compiler. I have currently defined a data type Vec_256_Integer_32 like so: N ...

Unpacking real and imaginary parts of complex numbers into separate ymm registers

I need to read a sequence of complex single precision numbers, stored like [real1, imag1, real2, imag2, ...] into ymm registers and unpack them such t ...

What is the best way to loop AVX for un-even non-aligned array?

If array cannot be divided by 8 (for integer), what is the best way to write cycle for it? Possible way I figured out so far is to divide it into 2 se ...

How to analyze the instructions pipelining on Zen4 for AVX-512 packed double computations? (backend bound)

I got access to the AMD Zen4 server and tested AVX-512 packed double performance. I chose Harmonic Series Sum[1/n over positive integers] and compared ...

How to do mask / conditional / branchless arithmetic operations in AVX2

I understand how to do general arithmetic operations in AVX2. However, there are conditional operations in scalar code I would like to translate to AV ...

How could I conditioally create a _m256d of -1.0 and +1.0 according to _m256i elements being even or odd?

I am quite new to the compiler intrinsincs. I have 4 uint64_t integers which are stored in a _m256i. And I would like to get a __m256d res = {1.0, ...

Why on earth would I want to use PMULHRSW/VPMULHRSW?

I was looking for an appropriate AVX2 multiplication instruction to use in my code, and came across the vpmulhrsw (_mm256_mulhrs_epi16(__m256i a, __m2 ...

Quickest way to shift/rotate byte vector with SIMD

I have a avx2(256 bit) SIMD vector of bytes that is padded with zeros in front and in the back that looks like this: [0, 2, 3, ..., 4, 5, 0, 0, 0]. Th ...

Efficient transpose of 2D nibble matrix?

Given a 2D 4x8 nibble matrix, represented as a 16-byte uint8_t array. For every pair of nibbles i, j, the byte is computed as so: (j << 4) | i. ...

Compare two 128-bit value with AVX512

I have a case to compare two 128-bit unsigned long long a, b on my computer (i7-11700). I need to find out whether a is greater than or equal to b or ...

-march=haswell vs -march=core-avx2 vs -mavx2

Title says it all. What are the differences and tradeoffs between -march=haswell, -march=core-avx2, and -mavx2 for compiling avx2 intrinsics? I know ...

Which mobile windows devices don't support AVX2

I understand that Intels AVX2 extension is on the market since 2011 and therefore it is pretty much standard in modern devices. However, for some dec ...

Is uops.info wrong about vinserti128?

According to uops.info, the reciprocal throughput of vinserti128 is 0.5 if the xmm argument comes from memory, and 1 if the xmm argument is a register ...

Why don't gcc/clang vectorize 128-bit SIMD intrinsics into 256-bit when possible?

Suppose I have this function: Clang and gcc both produce 256-bit SIMD when compiled with -O3 -march=core-avx2 (godbolt). Now suppose I have this f ...

Why is masking needed before using a pshufb shuffle as a lookup table for nibbles?

This code comes from https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp. The code is used to replace the builtin_popcn ...

Is there any Intrinsic in AVX2 Architecture similar to _mm_min_round_ss in avx512?

I'm a beginner and working on AVX2 architecture and I would like to use an intrinsic which does the same functionality of the _mm_min_round_ss in AVX- ...

Does icc -xCORE-AVX2 force the non-utilisation of AVX512 instructions on Xeon Gold if -O3 is on?

As per the title, Will programs compiled with the intel compiler under icc -O3 -xCORE-AVX2 program.cpp Generate AVX512 instructions on a Xeon Gold ...

Horizontal min on avx2 8 float register and shuffle paired registers alongside

After ray vs triangle intersection test in 8 wide simd, I'm left with updating t, u and v which I've done in scalar below (find lowest t and updating ...

AVX2 - storing integers at arbitrary indices in an array

I am looking for an intrinsic function that can take the 8 32-bit integers in an avx2 register and store them each at their own index in an array (ess ...