Why increase of array alignment degrades performance?

Question

I am trying to increase aligment of array in synthetic test from 16 to 32 and performance degrades from ~4100ms to ~4600ms. How higher alignment can harm performance?

Below is the code which I use for testing (I am trying to utilize avx instructions here). Build with g++ test.cpp -O2 -ftree-vectorize -mavx2 (I have no support of avx512).

#include <chrono>
#include <iostream>
#include <memory>
#include <cassert>
#include <cstring>
#include <cstdlib>

using Time = std::chrono::time_point<std::chrono::system_clock>;
using Clock = std::chrono::system_clock;

template <typename Duration>
auto as_ms(Duration const& duration) {
    return std::chrono::duration_cast<std::chrono::milliseconds>(duration);
}

static const int repeats = 10000;

struct I {
    static const int size = 524288;
    int* pos;
    I() : pos(new int[size]) { for (int i = 0; i != size; ++i) { pos[i] = i; } }
    ~I() { delete pos; } 
};

static const int align = 16; // try to change here 16 (4100 ms) / 32 (4600 ms)

struct S {
    static const int size = I::size;
    alignas(align) float data[size];
    S() { for (int i = 0; i != size; ++i) { data[i] = (i * 7 + 11) % 2; } }
};

void foo(const I& p, S& a, S& b) {
    const int chunk = 32;
    alignas(align) float aprev[chunk];
    alignas(align) float anext[chunk];
    alignas(align) float bprev[chunk];
    alignas(align) float bnext[chunk];
    const int N = S::size / chunk;
    for (int j = 0; j != repeats; ++j) {
        for (int i = 1; i != N-1; i++) {
            int ind = p.pos[i] * chunk;
            std::memcpy(aprev, &a.data[ind-1], sizeof(float) * chunk);
            std::memcpy(anext, &a.data[ind+1], sizeof(float) * chunk);
            std::memcpy(bprev, &b.data[ind-1], sizeof(float) * chunk);
            std::memcpy(bnext, &b.data[ind+1], sizeof(float) * chunk);
            for (int k = 0; k < chunk; ++k) {
                int ind0 = ind + k;
                a.data[ind0] = (b.data[ind0] - 1.0f) * aprev[k] * a.data[ind0] * bnext[k] + a.data[ind0] * anext[k] * (bprev[k] - 1.0f);
            }
        }
    }
}

int main() {
    S a, b;
    I p;
    Time start = Clock::now();
    foo(p, a, b);
    Time end = Clock::now();
    std::cout << as_ms(end - start).count() << std::endl;
    float sum = 0;
    for (int i = 0; i != S::size; ++i) {
        sum += a.data[i];
    }
    return sum;
}

Checking if cache causes the problem:

valgrind --tool=cachegrind ./a.out

alignment = 16:

==4352== I   refs:      3,905,614,100
==4352== I1  misses:            1,626
==4352== LLi misses:            1,579
==4352== I1  miss rate:          0.00%
==4352== LLi miss rate:          0.00%
==4352== 
==4352== D   refs:      2,049,454,623  (1,393,712,296 rd   + 655,742,327 wr)
==4352== D1  misses:       66,707,929  (   66,606,998 rd   +     100,931 wr)
==4352== LLd misses:       66,681,897  (   66,581,942 rd   +      99,955 wr)
==4352== D1  miss rate:           3.3% (          4.8%     +         0.0%  )
==4352== LLd miss rate:           3.3% (          4.8%     +         0.0%  )
==4352== 
==4352== LL refs:          66,709,555  (   66,608,624 rd   +     100,931 wr)
==4352== LL misses:        66,683,476  (   66,583,521 rd   +      99,955 wr)
==4352== LL miss rate:            1.1% (          1.3%     +         0.0%  )

alignment = 32

==4426== I   refs:      2,857,165,049
==4426== I1  misses:            1,604
==4426== LLi misses:            1,560
==4426== I1  miss rate:          0.00%
==4426== LLi miss rate:          0.00%
==4426== 
==4426== D   refs:      1,558,058,149  (967,779,295 rd   + 590,278,854 wr)
==4426== D1  misses:       66,706,930  ( 66,605,998 rd   +     100,932 wr)
==4426== LLd misses:       66,680,898  ( 66,580,942 rd   +      99,956 wr)
==4426== D1  miss rate:           4.3% (        6.9%     +         0.0%  )
==4426== LLd miss rate:           4.3% (        6.9%     +         0.0%  )
==4426== 
==4426== LL refs:          66,708,534  ( 66,607,602 rd   +     100,932 wr)
==4426== LL misses:        66,682,458  ( 66,582,502 rd   +      99,956 wr)
==4426== LL miss rate:            1.5% (        1.7%     +         0.0%  )

Seems like the problem is not in cache.

Checking that problem is not in Turbo Boost.

alignment: 16 --> 32

with Turbo Boost enabled : ~4100ms --> ~4600ms

with Turbo Boost disabled : ~5000ms --> ~5400ms

Answer 1

Not an answer, but I did some measurements with GNU g++ 6.4.0 and Intel icpc 18.0.1 on Haswell E5-2680v3. All the times were very consistent with deviations of few milliseconds only:

g++ -O2 -mavx2 -ftree-vectorize align=16 : 6.99 [s]
g++ -O2 -mavx2 -ftree-vectorize align=32 : 6.67 [s]
g++ -O3 -mavx2 -ftree-vectorize align=16 : 6.72 [s]
g++ -O3 -mavx2 -ftree-vectorize align=32 : 6.60 [s]
g++ -O2 -march=haswell -ftree-vectorize align=16 : 6.45 [s]
g++ -O2 -march=haswell -ftree-vectorize align=32 : 6.45 [s]
g++ -O3 -march=haswell -ftree-vectorize align=16 : 6.44 [s]
g++ -O3 -march=haswell -ftree-vectorize align=32 : 6.44 [s]
icpc -O2 -xCORE-AVX2 align=16 : 3.67 [s]
icpc -O2 -xCORE-AVX2 align=32 : 3.51 [s]
icpc -O3 -xCORE-AVX2 align=16 : 3.65 [s]
icpc -O3 -xCORE-AVX2 align=32 : 3.59 [s]

Conslusions:

For GCC, -march=haswell was faster than -mavx2 and, for -march=haswell , practically insensitive to optimization level and alignment.
ICC provided much much lower runtimes / higher performance than GCC.
For ICC, proper alignment helped, but only slightly.
Interestingly, ICC with higher optimization level provided higher runtime, but this might be attributed to larger program size and therefore more instruction cache misses.

In my experience, ICC is generally much better than GCC, at least as for vectorization. I haven't studied the generated machine code, but one can do this using, eg, Godbolt to understand why is ICC such superior in this case.

Why increase of array alignment degrades performance?

Question

1 answers

solution1
1 2018-03-21 11:41:50

Why increase of array alignment degrades performance?

Question

1 answers

solution1 1 2018-03-21 11:41:50

solution1
1 2018-03-21 11:41:50