简体   繁体   中英

Segmentation fault caused by SIMD gather?

My project uses SIMD gather for accelerating table lookup. The following is a simplified version, but enough for stating the issue I encountered.

#include <x86intrin.h>
#include <stdio.h>

alignas(32) static int a[256][8] = { 0 };

int main(){
    // initialize 32 bytes (as a __m256i)
    int *s = (int*)_mm_malloc(32, 4);
    for(int i=0; i<8; i++)
        s[i] = i;

    __m256i *t = (__m256i*)s;
    // do table lookup task using SIMD gather
    for(int i=0; i<100000; i++){
        int *addr = a[i % 256];
        t[0] = _mm256_i32gather_epi32(addr, t[0], 4);
    }

    // print out the result
    for(int i=0; i<8; i++)
        printf("%d ", s[i]);
    printf("\n");
}

Compile and Execution

user@server:~/test$ g++ -O3 -mavx2 gather.cpp 
user@server:~/test$ ./a.out
Segmentation fault (core dumped)

Actually, there is an alternative version using SIMD shuffle with __m128i, which works normally. Does anyone has idea?

_mm_malloc (size_t size, size_t align) - you're only aligning by 4, then doing an alignment-required dereference of a __m256i* . Presumably that segfaults when _mm_malloc(32, 4) happens to return memory that isn't aligned by 32.

Just use _mm256_set_epi32(7,6,5,4,3,2,1,0); like a normal person, or alignas(32) a local array that you can init in a loop. (And/or you can use _mm256_loadu_si256 to do an unaligned load).

You could fix your code by using _mm_malloc(32,32) , but don't. It's very silly to dynamically allocate (and then leak) a single 32 byte object that you only want for local use.


Prefer shuffle over gather when all the data comes from one or two 32-byte chunks

An 8-element gather costs about as much as 8 scalar or vector loads, in terms of cache accesses, plus some work for other execution units. ( https://uops.info/ and https://agner.org/optimize/ ). Gather doesn't get more efficient when multiple elements come from the same cache line, unfortunately.

In your case you don't even need a shuffle, just a 32-byte load from a part of a[][] .

int *addr = a[i % 256]; gets a pointer to a 32-byte aligned int [8] , from which you can _mm256_load_si256((const __m256i*)addr) . That gives you the elements in the 0..7 native order you want.

If you did want orders other than 0..7, use AVX2 vpermd ( _mm256_permutevar8x32_epi32 ) with the same shuffle-control vector constant you were using as gather indices.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM