简体   繁体   中英

gcc auto-vectorization fails in a reduction loop

I am trying to compile my code with auto-vectorization flags but I encounter a failure in a very simple reduction loop:

double node3::GetSum(void){
    double sum=0.;
    for(int i=0;i<8;i++) sum+=c_value[i];
    return sum;
}

where the c_value[i] array is defined as

class node3{
private:
    double c_value[9];

The auto-vectorization compilation returns: Analyzing loop at node3.cpp:10

node3.cpp:10: note: step unknown.
node3.cpp:10: note: reduction: unsafe fp math optimization: sum_6 = _5 + sum_11;

node3.cpp:10: note: Unknown def-use cycle pattern.
node3.cpp:10: note: Unsupported pattern.
node3.cpp:10: note: not vectorized: unsupported use in stmt.
node3.cpp:10: note: unexpected pattern.
node3.cpp:8: note: vectorized 0 loops in function.

node3.cpp:10: note: Failed to SLP the basic block.
node3.cpp:10: note: not vectorized: failed to find SLP opportunities in basic block.

I really do not understand why it can't determine the basic block for SLP for example. Moreover I guess I did not understand what really is the "unsupported use in stmt": the loop here simply sums a sequential access array.

Could such problems be caused by the fact that c_value[] is defined in the private of the class?

Thanks in advance.

Note: compiled as g++ -c -O3 -ftree-vectorizer-verbose=2 -march=native node3.cpp and also tried with more specific -march=corei7 but same results. GCC Version: 4.8.1

I managed to vectorize the loop at the end with the following trick:

double node3::GetSum(void){
    double sum=0.,tmp[8];
    tmp[0]=c_value[0]; tmp[1]=c_value[1]; tmp[2]=c_value[2]; tmp[3]=c_value[3];
    tmp[4]=c_value[4]; tmp[5]=c_value[5]; tmp[6]=c_value[6];tmp[7]=c_value[7];
    for(int i=0;i<8;i++) sum+=tmp[i];
    return sum;
}

where I created the dummy array tmp[] . This trick, together with another compilation flag ie, -funsafe-math-optimizations (@Mysticial: this is actually the only thing I need, -ffast-math with other things I apparently don't need), makes the auto-vectorization successful.

Now, I don't really know if this solution really speeds-up the execution. It does vectorize, but I added an assign operation so I'm not sure if this should run faster. My feeling is that on the long run (calling the function many times) it does speed-up a little, but I can't prove that. Anyway this is a possible solution to the vectorization problem, so I posted as an answer.

It's annoying that the freedom to vectorize reductions is coupled with other (literally) unsafe optimizations. In my examples, a bug is surfacing (with gcc but not g++) with the combination of -mavx and -funsafe-math-optimizations, where a pointer which should never be touched gets clobbered. Auto-vectorization doesn't consistently speed up such short loops, particularly because the sum reduction epilogue with the hadd instruction is slow on the more common CPUs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM