std::array of AVX intrinsics

Question

I don't know if there's something missing on my understanding of how AVX intrinsics works with std::array , but I'm having a strange issue with Clang when I combine the two.

Sample code:

std::array<__m256, 1> gen_data()
{
    std::array<__m256, 1> res;
    res[0] = _mm256_set1_ps(1);
    return res;
}

void main()
{
    auto v = gen_data();
    float a[8];
    _mm256_storeu_ps(a, v[0]);
    for(size_t i = 0; i < 8; ++i)
    {
        std::cout << a[i] << std::endl;
    }
}

Output from Clang 3.5.0 (upper 4 floats are garbage data):

1
1
1
1
8.82272e-39
0
5.88148e-39
0

Output from GCC 4.8.2/4.9.1 (expected):

If I instead pass v into gen_data as an output parameter it works just fine on both compilers. I'm willing to accept that this might be a bug in Clang, however I don't know if this might be undefined behavior( UB ). Testing with Clang 3.7* (svn build) and Clang appears to now give my expected result. If I switch to SSE 128-bit intrinsics ( __m128 ) then all compilers give the same expected results.

So my questions are:

Is there any UB here? Or is this just a bug in Clang 3.5.0?
Is my understanding that __m256 is simply a 32-byte aligned chunk of memory correct? Or is there something else special about it that I have to be careful with?

Answer 1

This looks like this is clang bug that is now fixed, we can see this from this bug report , which demonstrates a very similar problem using regular arrays.

Assuming std::array implements its storage similar to this:

T elems[N];

which both libc++ and libstdc++ seem to do then this should analogous. One of the comments says:

However, libc++'s std::array<__m256i, 1> does not work at any optimization level.

The bug report was actually based off of this SO question: Is this incorrect code generation with arrays of __m256 values a clang bug? which is very similar but deals with the regular array case.

The bug report contains one possible work-around, which the OP stated is sufficient:

In my actual code, num_vectors is calculated based on some C++ template parameters to the simd_pack type. In many cases, that comes out to be 1, but it also is often greater than 1. Your observation gives me an idea, though; I could try to introduce a template specialization that catches the case where num_vectors == 1 . It could instead just use a single __m256 member instead of an array of size 1. I'll have to check to see how feasible that is.

std::array of AVX intrinsics

Question

1 answers

solution1
5 ACCPTED 2015-03-24 03:49:31

std::array of AVX intrinsics

Question

1 answers

solution1 5 ACCPTED 2015-03-24 03:49:31

solution1
5 ACCPTED 2015-03-24 03:49:31