Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang

Question

It is known that GCC/CLang auto-vectorize loops well using SIMD instructions.

Also it is known that there exist alignas() standard C++ attribute, which among other uses also allows to align stack variable, for example following code:

Try it online!

#include <cstdint>
#include <iostream>

int main() {
    alignas(1024) int x[3] = {1, 2, 3};
    alignas(1024) int (&y)[3] = *(&x);

    std::cout << uint64_t(&x) % 1024 << " "
        << uint64_t(&x) % 16384 << std::endl;
    std::cout << uint64_t(&y) % 1024 << " "
        << uint64_t(&y) % 16384 << std::endl;
}

Outputs:

0 9216
0 9216

which means that both x and y are aligned on stack on 1024 bytes but not 16384 bytes.

Lets now see another code:

Try it online!

#include <cstdint>

void f(uint64_t * x, uint64_t * y) {
    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

if compiled with -std=c++20 -O3 -mavx512f attributes on GCC it produces following asm code (provided part of code):

        vmovdqu64       zmm1, ZMMWORD PTR [rdi]
        vpxorq  zmm0, zmm1, ZMMWORD PTR [rsi]
        vmovdqu64       ZMMWORD PTR [rdi], zmm0
        vmovdqu64       zmm0, ZMMWORD PTR [rsi+64]
        vpxorq  zmm0, zmm0, ZMMWORD PTR [rdi+64]
        vmovdqu64       ZMMWORD PTR [rdi+64], zmm0

which two times does AVX-512 unaligned load + xor + unaligned store. So we can understand that our 64-bit array-xor operation was auto-vectorized by GCC to use AVX-512 registers, and loop was unrolled too.

My question is how to tell GCC that provided to function pointers x and y are both aligned to 64 bytes, so that instead of unaligned load ( vmovdqu64 ) like in code above, I can force GCC to use aligned load ( vmovdqa64 ). It is known that aligned load/store can be considerably faster.

My first try to force GCC to do aligned load/store was through following code:

Try it online!

#include <cstdint>

void  g(uint64_t (&x_)[16],
        uint64_t const (&y_)[16]) {

    alignas(64) uint64_t (&x)[16] = x_;
    alignas(64) uint64_t const (&y)[16] = y_;

    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

but this code still produces unaligned load ( vmovdqu64 ) same as in asm code above (of previous code snippet). Hence this alignas(64) hint doesn't give anything useful to improve GCC assembly code.

My Question is how do I force GCC to make aligned auto-vectorization, except for manually writing SIMD intrinsics for all operations like _mm512_load_epi64() ?

If possible I need solutions for all of GCC/CLang/MSVC.

Answer 1

Though not entirely portable for all compilers, __builtin_assume_aligned will tell GCC to assume the pointer are aligned.

I often use a different strategy that is more portable using a helper struct:

template<size_t Bits>
struct alignas(Bits/8) uint64_block_t
{
    static const size_t bits = Bits;
    static const size_t size = bits/64;
    
    std::array<uint64_t,size> v;
    
    uint64_block_t& operator&=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] &= v2.v[i]; return *this; }
    uint64_block_t& operator^=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] ^= v2.v[i]; return *this; }
    uint64_block_t& operator|=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] |= v2.v[i]; return *this; }
    uint64_block_t operator&(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp &= v2; }
    uint64_block_t operator^(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp ^= v2; }
    uint64_block_t operator|(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp |= v2; }
    uint64_block_t operator~() const { uint64_block_t tmp; for (size_t i = 0; i < size; ++i) tmp.v[i] = ~v[i]; return tmp; }
    bool operator==(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return false; return true; }
    bool operator!=(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return true; return false; }
    
    bool get_bit(size_t c) const   { return (v[c/64]>>(c%64))&1; }
    void set_bit(size_t c)         { v[c/64] |= uint64_t(1)<<(c%64); }
    void flip_bit(size_t c)        { v[c/64] ^= uint64_t(1)<<(c%64); }
    void clear_bit(size_t c)       { v[c/64] &= ~(uint64_t(1)<<(c%64)); }
    void set_bit(size_t c, bool b) { v[c/64] &= ~(uint64_t(1)<<(c%64)); v[c/64] |= uint64_t(b ? 1 : 0)<<(c%64); }
    size_t hammingweight() const   { size_t w = 0; for (size_t i = 0; i < size; ++i) w += mccl::hammingweight(v[i]); return w; }
    bool parity() const            { uint64_t x = 0; for (size_t i = 0; i < size; ++i) x ^= v[i]; return mccl::hammingweight(x)%2; }
};

and then convert the pointer to uint64_t to a pointer to this struct using reinterpret_cast.

Converting a loop over uint64_t into a loop over these blocks typically auto vectorize very well.

Answer 2

As I imply from your own answer, you're interested in MSVC solution too.

MSVC understands the proper use of alignas as well as its own __declspec(align) , it also understands __builtin_assume_aligned , but it intentionally does not want to do anything with known alignment.

My report closed as "Duplicate":

Compiler does not take advantage of alignment provided by alignas or __builtin_assume_aligned

The related reports closed as "Not a bug":

[MSConnect 3068950] - C++: MOVUPS is generated for alignof(16) data instead of MOVAPS
Regression (from VS 2015) in SSSE/AVX instructions generation ((V)MOVUPS instead of (V)MOVAPS)

~~MSVC still takes advantage of alignment of global variables, if it can observe that the pointer points to the global variable.~~ Even this does not work in every case.

Answer 3

Just now @MarcStevens suggested a working solution for my Question, through using __builtin_assume_aligned :

Try it online!

#include <cstdint>

void f(uint64_t * x_, uint64_t * y_) {
    uint64_t * x = (uint64_t *)__builtin_assume_aligned(x_, 64);
    uint64_t * y = (uint64_t *)__builtin_assume_aligned(y_, 64);

    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

It actually produces code with aligned vmovdqa64 instruction.

But only GCC produces aligned instruction. CLang still uses unaligned, see here , also CLang uses AVX-512 registers only with more than 16 elements.

So still CLang and also MSVC solutions are welcome.

Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang

Question

3 answers

solution1
1 ACCPTED 2021-11-20 13:45:40

solution2
1 2021-11-20 14:15:25

solution3
0 2021-11-20 13:40:43

Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang

Question

3 answers

solution1 1 ACCPTED 2021-11-20 13:45:40

solution2 1 2021-11-20 14:15:25

solution3 0 2021-11-20 13:40:43

solution1
1 ACCPTED 2021-11-20 13:45:40

solution2
1 2021-11-20 14:15:25

solution3
0 2021-11-20 13:40:43