Tag[sse2] Recent Newest Questions

In SIMD, SSE2，many instructions named as "_mm_set_epi8"，"_mm_cmpgt_epi8 " and so on，what does "mm" "epi" mean?

I see many instruction with shorthand such as "_mm_and_si128". I want to know what does the "mm" mean. ...

Access value from __m128 in rust by index

I have seen that it's rather simple in C to access values in a __m128 register by index. However, it is not possible to do that in rust. How can I acc ...

AVX divide __m256i packed 32-bit integers by two (no AVX2)

I'm looking for the fastest way to divide an __m256i of packed 32-bit integers by two (aka shift right by one) using AVX. I don't have access to AVX2. ...

Can FP compares like SSE2 _mm_cmpeq_pd be used to compare 64 bit integers?

Can FP compares like SSE2 _mm_cmpeq_pd / AVX _mm_cmp_pd be used to compare 64 bit integers? The idea is to emulate missing _mm_cmpeq_epi64 that would ...

Is there a difference between SVML vs. normal intrinsic square root functions?

Is there any sort of difference in precision or performance between normal sqrtps/pd or the SVML version: I know that SVML Intrinsics like _mm_si ...

how to set a int32 value at some index within an m128i with only SSE2?

Is there a SSE2 intrinsics that can set a single int32 value within m128i? Such as set value 1000 at index 1 on a m128i that already contains 1,2,3,4 ...

_mm_load_si128 loads data in reverse order

I am writing a C function with SSE2 intrinsics to essentially compare 4 32 bit integers and check to see which are greater than zero, and give that re ...

The right way to use function _mm_clflush to flush a large struct

I am starting to use functions like _mm_clflush, _mm_clflushopt, and _mm_clwb. Say now as I have defined a struct name mystruct and its size is 256 B ...

Quick workaround for SSE2 movq instruction on non-SSE2 CPUs

How could I convert a movq SSE2 instruction into a simple code snippet which I could later patch into the original EXE which cointained? Please if you ...

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

Looking through the intel intrinsics guide, I saw this instruction. Looking through the naming pattern, the meaning should be clear: "Shift 128-bit re ...

Better way to store or extract scalar int result using SSE2 intrinsic

I'm wondering how load and store efficiently vars when working with SSE2. In this example, I want to bench the pclmulqdq instruction (carry less mult ...

SSE2 registers in x86 assembly

I have the following code: Basically, I take number from user and then I want to calculate factorial of this number using SSE2. The "factorial" par ...

What is the most efficient way to do unsigned 64 bit comparison on SSE2?

PCMPGTQ doesn't exist on SSE2 and doesn't natively work on unsigned integers. Our goal here is to provide backward-compatible solutions for unsigned 6 ...

SSE4.1 unsigned integer comparison with overflow

Is there any way to perform a comparison like C >= (A + B) with SSE2/4.1 instructions considering 16 bit unsigned addition (_mm_add_epi16()) can ov ...

How to simulate pcmpgtq on sse2?

PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this f ...

How to add to variable using SSE2?

How to "add to" variable using SSE2? I've recently been working with SSE2 in C++ to optimize a few math functions up, but ran into a problem when att ...

How do you do signed 32bit widening multiplication on SSE2?

This question came up when reviewing the WebAssembly SIMD proposal for extended multiplication. To support older hardware, we need to support SSE2 an ...

how would you optimize this vectorized sum of harmonics?

I'm summing a bounch of harmonics together, with different phase/magnitude each, using vectorization (only SSE2 max as SIMD). Here's my actual try: ...

How to copy X bytes or bits from an __m128i into standard memory

I have a loop that's adding int16s from two arrays together via _mm_add_epi16(). There's a small array and a large array, the results get written back ...

Test if any byte in an xmm register is 0

I am currently teaching myself SIMD and am writing a rather simple String processing subroutine. I am however restricted to SSE2, which makes me unabl ...