I don't know if there's something missing on my understanding of how AVX intrinsics works with std::array
, but I'm having a strange issue with Clang when I combine the two.
Sample code:
std::array<__m256, 1> gen_data()
{
std::array<__m256, 1> res;
res[0] = _mm256_set1_ps(1);
return res;
}
void main()
{
auto v = gen_data();
float a[8];
_mm256_storeu_ps(a, v[0]);
for(size_t i = 0; i < 8; ++i)
{
std::cout << a[i] << std::endl;
}
}
Output from Clang 3.5.0 (upper 4 floats are garbage data):
1 1 1 1 8.82272e-39 0 5.88148e-39 0
Output from GCC 4.8.2/4.9.1 (expected):
1 1 1 1 1 1 1 1
If I instead pass v
into gen_data
as an output parameter it works just fine on both compilers. I'm willing to accept that this might be a bug in Clang, however I don't know if this might be undefined behavior( UB ). Testing with Clang 3.7* (svn build) and Clang appears to now give my expected result. If I switch to SSE 128-bit intrinsics ( __m128
) then all compilers give the same expected results.
So my questions are:
This looks like this is clang bug that is now fixed, we can see this from this bug report , which demonstrates a very similar problem using regular arrays.
Assuming std::array
implements its storage similar to this:
T elems[N];
which both libc++
and libstdc++
seem to do then this should analogous. One of the comments says:
However, libc++'s
std::array<__m256i, 1>
does not work at any optimization level.
The bug report was actually based off of this SO question: Is this incorrect code generation with arrays of __m256 values a clang bug? which is very similar but deals with the regular array case.
The bug report contains one possible work-around, which the OP stated is sufficient:
In my actual code,
num_vectors
is calculated based on some C++ template parameters to thesimd_pack
type. In many cases, that comes out to be 1, but it also is often greater than 1. Your observation gives me an idea, though; I could try to introduce a template specialization that catches the case wherenum_vectors == 1
. It could instead just use a single__m256
member instead of an array of size 1. I'll have to check to see how feasible that is.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.