简体   繁体   中英

Array vs pointer auto-vectorization in gcc

I'm trying to use auto-vectorization with g++ 5.4 ( -ftree-vectorize ). I noticed that the array version in the code below something is causing the compiler to miss the vectorization opportunity in the inner loop resulting in a significant performance difference compared to the pointer version. Is there anything that can be done to help the compiler in this case?

void floydwarshall(float* mat, size_t n) {
#if USE_POINTER
    for (int k = 0; k < n; ++k) {
        for (int i = 0; i < n; ++i) {
            auto v = mat[i*n + k];
            for (int j = 0; j < n; ++j) {
                auto val = v + mat[k*n+j];
                if (mat[i*n + j] > val) {
                    mat[i*n + j] = val;
                }
            }
        }
    }
#else // USE_ARRAY
    typedef float (*array)[n];
    array m = reinterpret_cast<array>(mat);
    for (int k = 0; k < n; ++k) {
        for (int i = 0; i < n; ++i) {
            auto v = m[i][k];
            for (int j = 0; j < n; ++j) {
                auto val = v + m[k][j];
                if (m[i][j] > val) {
                    m[i][j] = val;
                }
            }
        }
    }
#endif
}

Both versions do vectorize, with g++5.4 -O3 -march=haswell , using vcmpltps/vmaskmovps in the inner loop for the reason Marc points out.

If you're not letting the compiler use AVX instructions, it would have a harder time. But I don't see either version vectorize at all if I just use -O3 (so only SSE2 is available, since it's baseline for x86-64). So your original question is based on a result I can't reproduce.

Changing the if() to a ternary operator (so the code always stores to the array) lets the compiler load/ MINPS / unconditionally store. This is a lot of memory traffic if your matrix doesn't fit in cache; maybe you can arrange your loops a different way? Or maybe not, since m[i][k] is needed, and I assume it matters what order things happen.

If updates are very infrequent and write-back of dirty data is contributing to a memory bottleneck, it might even be worth branching to avoid the store if none of the vector elements were modified.


Here's an array version that vectorizes well , even with just SSE2. I added code to tell the compiler the input is aligned and the size is a multiple of 8 (number of floats per AVX vector). If your real code can't make these assumptions, then take that part out. It makes the vectorized part easier to find, because it's not buried in scalar intro/cleanup code. (Using -O2 -ftree-vectorize doesn't fully unroll the cleanup code this way, but -O3 does.)

I notice that without AVX, gcc still uses unaligned loads but aligned stores. Maybe it's not realizing that the size being a multiple of 8 should make m[k][j] aligned if m[i][j] is aligned? This might be the difference between the pointer version and the array version.

code on the Godbolt compiler explorer

void floydwarshall_array_unconditional(float* mat, size_t n) {

    // try to tell gcc that it doesn't need scalar intro/outro code
    // The vectorized inner loop isn't particularly different without these, but it means less wading through scalar cleanup code (and less bloat if you can use this in your real code).

    // works with gcc6, doesn't work with gcc5.4
    mat = (float*)__builtin_assume_aligned(mat, 32);
    n /= 8;
    n *= 8;         // code is simpler if matrix size is always a multiple of 8 (floats per AVX vector)

    typedef float (*array)[n];
    array m = reinterpret_cast<array>(mat);

    for (size_t k = 0; k < n; ++k) {
        for (size_t i = 0; i < n; ++i) {
            auto v = m[i][k];
            for (size_t j = 0; j < n; ++j) {
                auto val = v + m[k][j];
                m[i][j] = (m[i][j]>val) ? val : m[i][j];   // Marc's suggested change: enables vectorization with unconditional stores.
            }
        }
    }
}

gcc5.4 doesn't manage to avoid the scalar intro/cleanup code around the vectorized part, but gcc6.2 does. The vectorized part is basically the same from both compiler versions.

## The inner-most loop (with gcc6.2 -march=haswell -O3)
.L5:
    vaddps  ymm0, ymm1, YMMWORD PTR [rsi+rax]
    vminps  ymm0, ymm0, YMMWORD PTR [rdx+rax]     #### Note use of minps and unconditional store, enabled by using the ternary operator instead of if().
    add     r14, 1
    vmovaps YMMWORD PTR [rdx+rax], ymm0
    add     rax, 32
    cmp     r14, r13
    jb      .L5

The next loop outside that does some integer counter checking (using some setcc stuff), and does vmovss xmm1, DWORD PTR [rax+r10*4] and a separate vbroadcastss ymm1, xmm1 . Presumably the scalar cleanup that it's jumping to doesn't need the broadcast, and gcc doesn't know that it would be cheaper overall just to use VBROADCASTSS as a load even when the broadcast part isn't needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM