I need to load 4 bytes stored consecutively in an array in a specific position of a __m128i variable, namely to be able to do many int32_t sums, 4 at a time, storing all partial results.
For example:
const unsigned int SIZE = 2000000;
const unsigned int STEP = 100;
unsigned char* inBuffer = new char[SIZE];
//Fill inBuffer
const unsigned char* a = inBuffer;
int32_t* outBuffer = new int32_t[SIZE/STEP*4];
int32_t* result = outBuffer;
__m128i sum = _mm_setzero_si128 ()
for (int i = 0; i < SIZE; i+=STEP) {
__m128i value = _mm_set_epi32 (a[3],a[2],a[1],a[0]);
sum = __mm_add_epi32(sum,value);
_mm_storeu_si128 ((__m128i*)result,sum);
a+=STEP;
result+=4;
}
//Print outBuffer
delete[] inBuffer;
delete[] outBuffer;
I was wondering if there was a more efficient way to do so
The main problem here of course is this line:
__m128i value = _mm_set_epi32 (a[3],a[2],a[1],a[0]);
However a decent compiler should generate fairly efficient code for this. Take a look at the output ( gcc -O3 -S ...
) - if it's more than a few instructions then you may want to consider doing the load/unpack operations yourself.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.