为什么gcc（ARM）不使用全局寄存器变量作为源操作数？

Question

here is ac source code example: 这是一个交流源代码示例：

register int a asm("r8");
register int b asm("r9");

int main() {
    int c;
    a=2;
    b=3;
    c=a+b;
    return c;
}

And this is the assembled code generated using a arm gcc cross compiler: 这是使用arm gcc交叉编译器生成的汇编代码：

$ arm-linux-gnueabi-gcc  -c global_reg_var_test.c -Wa,-a,-ad

...
mov     r8, #2
mov     r9, #3
mov     r2, r8
mov     r3, r9
add     r3, r2, r3
...

When using -frename-registers, the behaviour was the same. 使用-frename-registers时，行为是相同的。 (updated. Before I had said with -O3.) （更新。在我对-O3说过之前。）

So the question is: why gcc add the 3rd and 4th MOV's instead of 'ADD R3, R8, R9'? 所以问题是：为什么gcc会添加第三和第四MOV而不是“ ADD R3，R8，R9”？

Context: I need to optimize a code in a simulated inorder cpu (gem5 arm minorcpu) that doesn't rename registers. 上下文：我需要在不重命名寄存器的模拟有序cpu（gem5 arm minorcpu）中优化代码。

Answer 1

I took real example (posted in comments) and put it on the godbolt compiler explorer . 我以真实的例子（发表在评论中）并将其放在godbolt编译器资源管理器上。 The main inefficiency in calc() is that src1 and src2 are globals it has to load from memory, instead of args passed in registers. calc()的主要src1是src1和src2是全局变量，必须从内存中加载，而不是在寄存器中传递args。

I didn't look at main , just calc . 我没有看main ，只是看calc 。

register int sum asm ("r4");
register int r asm ("r5");
register int c asm ("r6");
register int k asm ("r7");
register int temp1 asm ("r8");    // really?  you're using two global register vars for scratch temporaries?  Just let the compiler do its job.
register int temp2 asm ("r9");
register long n asm ("r10");
int *src1, *src2, *dst;

void calc() {
  temp1 = r*n;
  temp2 = k*n;

  temp1 = temp1+k;
  temp2 = temp2+c;

  // you get bad code for this because src1 and src2 are globals, not args passed in regs
  sum = sum + src1[temp1] * src2[temp2];
}

    # gcc 4.8.2 -O3 -Wall -Wextra -Wa,-a,-ad -fverbose-asm
    mla     r0, r10, r7, r6          @ temp2.9, n, k, c   @@ tmp = k*n + c
    movw    r3, #:lower16:.LANCHOR0  @ tmp136,
    mla     r8, r10, r5, r7          @ temp1, n, r, k     @@ temp1 = r*n + k
    movt    r3, #:upper16:.LANCHOR0  @ tmp136,
    ldmia   r3, {r1, r2}             @ tmp136,,           @@ load both pointers, since they're stored adjacently in memory
    mov     r9, r0                   @ temp2, temp2.9     @@ This insn is wasted: the first MLA should have had this as the dest
    ldr     r3, [r1, r8, lsl #2]     @ *_22, *_22
    ldr     r2, [r2, r9, lsl #2]     @ *_28, *_28
    mla     r4, r2, r3, r4           @ sum, *_28, *_22, sum
    bx      lr                       @

For some reason, one of the integer multiply-accumulate ( mla ) instructions uses r8 ( temp1 ) as the destination, but the other one writes to r0 (a scratch reg), and only later moves the result to r9 ( temp2 ). 出于某种原因，整数乘法累加（ mla ）指令之一使用r8 （ temp1 ）作为目标，但另一条指令写入r0 （暂存寄存器），并且仅在以后将结果移至r9 （ temp2 ）。

The sum += src1[temp1] * src2[temp2] is done with an mla that reads and writes r4 ( sum ). sum += src1[temp1] * src2[temp2]是通过读和写r4的mla （ sum ）完成的。

Why do you need temp1 and temp2 to be globals ? 为什么需要temp1和temp2才能成为全局变量 ？ That's just going to stop the optimizer from doing aggressive optimizations that don't calculate exactly the same temporaries that the C source does. 这只会阻止优化器进行激进的优化，而这些优化不会计算出与C源代码完全相同的临时时间。 Fortunately the C memory model is weak enough that it should be able to reorder assignments to them, although this might actually be why it didn't MLA into temp2 directly, since it decided to do that calculation first. 幸运的是，C内存模型足够脆弱，以至于它应该能够对它们进行重新排序，尽管这实际上可能就是为什么它没有直接将MLA直接放入temp2 ，因为它决定首先进行该计算。 (Hmm, does the memory model even apply? Other threads can't see our registers at all, so those globals are all effectively thread-local. It should allow relaxed ordering for assignments to globals. Signal handlers can see these globals, and could run at any point. gcc isn't following strict source order, since in the source both multiplies happen before either add.) （嗯，内存模型甚至适用吗？其他线程根本看不到我们的寄存器，因此这些全局变量实际上都是线程局部的。它应该允许对全局变量的分配放宽顺序。信号处理程序可以看到这些全局变量，并且可以gcc并不遵循严格的源代码顺序，因为在源代码中，两个乘积都在两个加法之前进行。）

Godbolt doesn't have a newer ARM gcc version, so I can't easily test a newer gcc. Godbolt没有较新的ARM gcc版本，因此我无法轻松地测试较新的gcc。 A newer gcc might do a better job with this. 较新的gcc可能会做得更好。

BTW, I tried a version of the function using local variables for temporaries, and didn't actually get better results . 顺便说一句，我尝试使用局部变量作为临时函数，但实际上并没有得到更好的结果。 Probably because there are still so many register globals that gcc couldn't pick convenient regs for the temporaries. 可能是因为仍然有太多的寄存器全局变量，以至于gcc无法为临时变量选择方便的reg。

// same register globals, except for temp1 and temp2.

void calc_local_tmp() {
  int t1 = r*n + k;
  sum += src1[t1] * src2[k*n + c];
}
    push    {lr}                      @ gcc decides to push to get a tmp reg
    movw    r3, #:lower16:.LANCHOR0   @ tmp131,
    mla     lr, r10, r5, r7           @ tmp133, n.1, r, k.2
    movt    r3, #:upper16:.LANCHOR0   @ tmp131,
    mla     ip, r7, r10, r6           @ tmp137, k.2, n.1, c
    ldr     r2, [r3]                  @ src1, src1
    ldr     r0, [r3, #4]              @ src2, src2
    ldr     r1, [r2, lr, lsl #2]      @ *_10, *_10
    ldr     r3, [r0, ip, lsl #2]      @ *_20, *_20
    mla     r4, r3, r1, r4            @ sum, *_20, *_10, sum
    ldr     pc, [sp], #4              @

Compiling with -fcall-used-r8 -fcall-used-r9 didn't help; 使用-fcall-used-r8 -fcall-used-r9编译-fcall-used-r8 -fcall-used-r9没有帮助； gcc makes the same code that pushes lr to get an extra temporary. gcc产生与推动lr以获得额外临时性相同的代码。 It fails to use ldmia (load-multiple) because it makes a sub-optimal choice of which temporary to put in which reg. 它无法使用ldmia （多次加载），因为它对将哪个临时文件放入哪个reg做出了次优选择。 ( &src1 in r0 would let it load src1 and src2 into r2 and r3 .) （ r0 &src1会将src1和src2加载到r2和r3 。）

为什么gcc（ARM）不使用全局寄存器变量作为源操作数？

问题描述

1 个解决方案

解决方案1
2 2016-04-15 01:54:54

为什么gcc（ARM）不使用全局寄存器变量作为源操作数？

问题描述

1 个解决方案

解决方案1 2 2016-04-15 01:54:54

解决方案1
2 2016-04-15 01:54:54