More aggresive optimization for FMA operations

Question

I want to build a datatype that represents multiple (say N ) arithmetic types and provides the same interface as an arithmetic type using operator overloading, such that I get a datatype like Agner Fog's vectorclass .

Please look at this example: Godbolt

#include <array>

using std::size_t;

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD operator*(const T a, const LoopSIMD& x){
        LoopSIMD result;
        for(size_t i=0;i<S;++i)
            result[i] = a*x[i];
        return result;
    }

    LoopSIMD& operator +=(const LoopSIMD& x){
        for(size_t i=0;i<S;++i){
            (*this)[i] += x[i];
        }
        return *this;
    }
};

constexpr size_t N = 7;
typedef LoopSIMD<double,N> SIMD;

SIMD foo(double a, SIMD x, SIMD y){
    x += a*y;
    return x;
}

That seems to work pretty good up to a certain number of elements, which is 6 for gcc-10 and 27 for clang-11. For a larger number of elements the compilers do not use the FMA (eg vfmadd213pd ) operations anymore. Instead they proceed the multiplications (eg vmulpd ) and additions (eg vaddpd ) separately.

Questions:

Is there a good reason for this behavior?
Is there any compiler flag such that I can increase the above mentioned values of 6 for gcc and 27 for clang?

Thank you!

Answer 1

I did the following, and was able to get some pretty good results, for gcc 10.2 with the same -Ofast -march=skylake -ffast-math as your godbolt link.

friend LoopSIMD operator*(const T a, const LoopSIMD& x) {
    LoopSIMD result;
    std::transform(x.cbegin(), x.cend(), result.begin(),
                   [a](auto const& i) { return a * i; });
    return result;
}

LoopSIMD& operator+=(const LoopSIMD& x) {
    std::transform(this->cbegin(), this->cend(), x.cbegin(), this->begin(),
                   [](auto const& a, auto const& b) { return a + b; });
    return *this;
}

std::transform has some crazy overloads so I think I need to explain.

The first overload captures a , multiplies each value, and stores it back at the beginning of result.

The second overload acts as a zip adding both values together from x and this and storing the result back to this .

If you're not married to operator+= and operator* you can create your own fma like so

    LoopSIMD& fma(const LoopSIMD& x, double a ){
        std::transform_inclusive_scan(
            x.cbegin(),
            x.cend(),
            this->begin(),
            std::plus{},
            [a](auto const& i){return i * a;},
            0.0);
        return *this;
    }

This requires c++17, but will loop keep the SIMD instruction in

foo(double, LoopSIMD<double, 40ul>&, LoopSIMD<double, 40ul> const&):
        xor     eax, eax
        vxorpd  xmm1, xmm1, xmm1
.L2:
        vfmadd231sd     xmm1, xmm0, QWORD PTR [rsi+rax]
        vmovsd  QWORD PTR [rdi+rax], xmm1
        add     rax, 8
        cmp     rax, 320
        jne     .L2
        ret

Answer 2

You could also simply make your own fma function:

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD fma(const LoopSIMD& x, const T y, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(x[i], y, z[i]);
        }
        return result;
    }
    friend LoopSIMD fma(const T y, const LoopSIMD& x, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(y, x[i], z[i]);
        }
        return result;
    }
    // And more variants, taking `const LoopSIMD&, const LoopSIMD&, const T`, `const LoopSIMD&, const T, const T`, etc
};

SIMD foo(double a, SIMD x, SIMD y){
    return fma(a, y, x);
}

But to allow for better optimisations in the first place, you should align your array. Your original code optimises well if you do:

constexpr size_t next_power_of_2_not_less_than(size_t n) {
    size_t pow = 1;
    while (pow < n) pow *= 2;
    return pow;
}

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    // operators
} __attribute__((aligned(next_power_of_2_not_less_than(sizeof(T[S])))));

// Or with a c++11 attribute
/*
template<class T, size_t S>
class [[gnu::aligned(next_power_of_2_not_less_than(sizeof(T[S])))]] LoopSIMD : std::array<T,S>
{
public:
    // operators
};
*/

SIMD foo(double a, SIMD x, SIMD y){
    x += a * y;
    return x;
}

Answer 3

I've found an improvement for the example given.

Adding #pragma omp simd before the loops GCC manages to make the FMA optimization up to N=71 .

https://godbolt.org/z/Y3T1rs37W

The size could even more improved if AVX512 is used:

https://godbolt.org/z/jWWPP7W5G

More aggresive optimization for FMA operations

Question

3 answers

solution1
0 2020-11-04 15:07:48

solution2
0 2020-11-04 17:39:54

solution3
0 2021-09-20 10:29:11

More aggresive optimization for FMA operations

Question

3 answers

solution1 0 2020-11-04 15:07:48

solution2 0 2020-11-04 17:39:54

solution3 0 2021-09-20 10:29:11

solution1
0 2020-11-04 15:07:48

solution2
0 2020-11-04 17:39:54

solution3
0 2021-09-20 10:29:11