gcc auto-vectorization fails in a reduction loop

Question

I am trying to compile my code with auto-vectorization flags but I encounter a failure in a very simple reduction loop:

double node3::GetSum(void){
    double sum=0.;
    for(int i=0;i<8;i++) sum+=c_value[i];
    return sum;
}

where the c_value[i] array is defined as

class node3{
private:
    double c_value[9];

The auto-vectorization compilation returns: Analyzing loop at node3.cpp:10

node3.cpp:10: note: step unknown.
node3.cpp:10: note: reduction: unsafe fp math optimization: sum_6 = _5 + sum_11;

node3.cpp:10: note: Unknown def-use cycle pattern.
node3.cpp:10: note: Unsupported pattern.
node3.cpp:10: note: not vectorized: unsupported use in stmt.
node3.cpp:10: note: unexpected pattern.
node3.cpp:8: note: vectorized 0 loops in function.

node3.cpp:10: note: Failed to SLP the basic block.
node3.cpp:10: note: not vectorized: failed to find SLP opportunities in basic block.

I really do not understand why it can't determine the basic block for SLP for example. Moreover I guess I did not understand what really is the "unsupported use in stmt": the loop here simply sums a sequential access array.

Could such problems be caused by the fact that c_value[] is defined in the private of the class?

Thanks in advance.

Note: compiled as g++ -c -O3 -ftree-vectorizer-verbose=2 -march=native node3.cpp and also tried with more specific -march=corei7 but same results. GCC Version: 4.8.1

Answer 1

I managed to vectorize the loop at the end with the following trick:

double node3::GetSum(void){
    double sum=0.,tmp[8];
    tmp[0]=c_value[0]; tmp[1]=c_value[1]; tmp[2]=c_value[2]; tmp[3]=c_value[3];
    tmp[4]=c_value[4]; tmp[5]=c_value[5]; tmp[6]=c_value[6];tmp[7]=c_value[7];
    for(int i=0;i<8;i++) sum+=tmp[i];
    return sum;
}

where I created the dummy array tmp[] . This trick, together with another compilation flag ie, -funsafe-math-optimizations (@Mysticial: this is actually the only thing I need, -ffast-math with other things I apparently don't need), makes the auto-vectorization successful.

Now, I don't really know if this solution really speeds-up the execution. It does vectorize, but I added an assign operation so I'm not sure if this should run faster. My feeling is that on the long run (calling the function many times) it does speed-up a little, but I can't prove that. Anyway this is a possible solution to the vectorization problem, so I posted as an answer.

Answer 2

It's annoying that the freedom to vectorize reductions is coupled with other (literally) unsafe optimizations. In my examples, a bug is surfacing (with gcc but not g++) with the combination of -mavx and -funsafe-math-optimizations, where a pointer which should never be touched gets clobbered. Auto-vectorization doesn't consistently speed up such short loops, particularly because the sum reduction epilogue with the hadd instruction is slow on the more common CPUs.

gcc auto-vectorization fails in a reduction loop

Question

2 answers

solution1
1 2014-01-25 10:19:39

solution2
0 2015-05-22 15:27:30

gcc auto-vectorization fails in a reduction loop

Question

2 answers

solution1 1 2014-01-25 10:19:39

solution2 0 2015-05-22 15:27:30

solution1
1 2014-01-25 10:19:39

solution2
0 2015-05-22 15:27:30