简体   繁体   English

g++ -O3 为循环创建奇怪的指令

[英]g++ -O3 creates strange instructions for loop

I'm writing some codes for numerical computation using c++.我正在使用 C++ 编写一些用于数值计算的代码。 I need to write code very carefully to help compiler to genrate good instructions.我需要非常仔细地编写代码来帮助编译器生成好的指令。 Then, I find some things strange for g++ 9.2 with -O3 flag.然后,我发现带有 -O3 标志的 g++ 9.2 有些奇怪。 I am not an expert of assemble, So I need someone to help me or point out where I am wrong.我不是组装专家,所以我需要有人帮助我或指出我错在哪里。

Full codes can be found here https://godbolt.org/z/fyuYtq .完整代码可以在这里找到https://godbolt.org/z/fyuYtq I copy and paste the key snippet here我在这里复制并粘贴关键片段

void sum_twopointer(Elem *p1, Elem *p2, ptrdiff_t stride, ptrdiff_t start, ptrdiff_t end) {

    Elem sm = 0;
    for(auto i = start;i != end; ++i) {
        p1[0] = p2[0] + p2[0];
        p1 += stride;
        p2 += stride;
    }

}

It is compiled with g++ -O3 .它是用g++ -O3编译的。 The version of g++ is 9.2. g++ 的版本是 9.2。 The assemble code is汇编代码是

sum_twopointer(double*, double*, long, long, long):
  cmp rcx, r8
  je .L32
  lea r9, [0+rdx*8]
  xor eax, eax
  cmp rdx, 1
  jne .L36
.L34:
  movsd xmm0, QWORD PTR [rsi+rax]
  add rcx, 1
  addsd xmm0, xmm0
  movsd QWORD PTR [rdi+rax], xmm0
  add rax, r9
  cmp r8, rcx
  jne .L34
.L32:
  ret
.L36:
  movsd xmm0, QWORD PTR [rsi+rax]
  add rcx, 1
  addsd xmm0, xmm0
  movsd QWORD PTR [rdi+rax], xmm0
  add rax, r9
  cmp r8, rcx
  jne .L36
  ret

As my understanding, the compiler is trying to do some optimization for the special case that stride is just 1, so it create a new branch for the case that stride==1, but it doesn't do anything further.据我了解,编译器正在尝试对 stride 仅为 1 的特殊情况进行一些优化,因此它为 stride==1 的情况创建了一个新分支,但它没有做任何进一步的事情。 Note that the codes following the .L34 are just identical to those following .L36 .请注意,.L34 之后的代码与 .L36 之后的代码完全相同

I have done some benchmarks for this.我为此做了一些基准测试。 The performance for stride=1 and stride=2, are list in the following.下面列出了 stride=1 和 stride=2 的性能。 The code is there https://gist.github.com/lhprojects/dac3a9fcf15bd5b1ec365ba6a87c679d代码在那里https://gist.github.com/lhprojects/dac3a9fcf15bd5b1ec365ba6a87c679d

g++ -O2
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_twopointer/8192/1       3743 ns         3742 ns       185062      stride=1
BM_twopointer/8192/2       1980 ns         1980 ns       328523      stride=2

g++ -O3
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_twopointer/8192/1       5006 ns         5001 ns       120725      stride=1
BM_twopointer/8192/2       2043 ns         2041 ns       333914      stride=2

Anyway, for stride=1, performance get worse with -O3 compared to -O2.无论如何,对于stride=1,与-O2 相比,-O3 的性能会变差。 I want to know what happend to my code.我想知道我的代码发生了什么。 Did I trigger some undefined behavior in c++?我是否在 C++ 中触发了一些未定义的行为? Or simply, there is a defect in code optimization in g++.或者干脆就是g++代码优化存在缺陷。 (I am sorry if my English writing let you feel very confused.) (如果我的英文写作让你感到很困惑,我很抱歉。)

I believe that the compiler needs to know that p1 and p2 don't overlap ... declaring them as __restrict pointers should allow the compiler to actually utilize the simd instructions.我相信编译器需要知道 p1 和 p2 不重叠......将它们声明为 __restrict 指针应该允许编译器实际利用 simd 指令。 It does seem odd to me that it would create a special case for stride==1, but then not do anything with that knowledge.对我来说,它会为 stride==1 创建一个特殊情况似乎很奇怪,但随后不对这些知识做任何事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM