在不同優化級別以gcc / g ++訪問本地變量和全局變量的速度

Question

我發現gcc中的不同編譯器優化級別在循環中訪問本地或全局變量時會產生完全不同的結果。 這讓我感到驚訝的原因是，如果訪問一種類型的變量比訪問另一種變量更可優化，我認為gcc優化會利用這一事實。 這里有兩個例子（在C ++中，但它們的C對應物實際上給出了相同的時間）：

    global = 0;
    for (int i = 0; i < SIZE; i++)
        global++;

它使用全局變量long global ，vs

    long tmp = 0;
    for (int i = 0; i < SIZE; i++)
        tmp++;
    global = tmp;

在優化級別-O0，時間基本相等（如我所料），在-O1它稍快但仍然相等，但是從-O2使用全局變量的版本要快得多（大約7倍）。

另一方面，在下面的代碼片段中，起始點指向大小為SIZE的字節塊：

    global = 0;
    for (const char* p = start; p < start + SIZE; p++)
        global += *p;

與

    long tmp = 0;
    for (const char* p = start; p < start + SIZE; p++)
        tmp += *p;
    global = tmp;

這里的-O0時間很接近，雖然使用局部變量的版本稍微快一些，這似乎並不太令人驚訝，因為它可能會存儲在寄存器中，而global不會。 然后在-O1和更高版本，使用局部變量的版本要快得多（超過50％或1.5倍）。 如前所述，這讓我感到驚訝，因為我認為對於gcc來說，使用局部變量（在生成的優化代碼中）稍后分配給全局變量就像我一樣容易。

所以我的問題是：全局變量和局部變量是什么使得gcc只能對一種類型執行某些優化，而不是另一種類型？

一些可能相關或不相關的細節：我在運行RHEL4且具有兩個單核處理器和4GB RAM的計算機上使用gcc / g ++版本3.4.5。 我用於SIZE的值是一個預處理器宏，它是1000000000.第二個例子中的字節塊是動態分配的。

以下是優化級別0到4的一些時序輸出（與上面的順序相同）：

$ ./st0
Result using global variable: 1000000000 in 2.213 seconds.
Result using local variable:  1000000000 in 2.210 seconds.
Result using global variable: 0 in 3.924 seconds.
Result using local variable:  0 in 3.710 seconds.
$ ./st1
Result using global variable: 1000000000 in 0.947 seconds.
Result using local variable:  1000000000 in 0.947 seconds.
Result using global variable: 0 in 2.135 seconds.
Result using local variable:  0 in 1.212 seconds.
$ ./st2
Result using global variable: 1000000000 in 0.022 seconds.
Result using local variable:  1000000000 in 0.552 seconds.
Result using global variable: 0 in 2.135 seconds.
Result using local variable:  0 in 1.227 seconds.
$ ./st3
Result using global variable: 1000000000 in 0.065 seconds.
Result using local variable:  1000000000 in 0.461 seconds.
Result using global variable: 0 in 2.453 seconds.
Result using local variable:  0 in 1.646 seconds.
$ ./st4
Result using global variable: 1000000000 in 0.063 seconds.
Result using local variable:  1000000000 in 0.468 seconds.
Result using global variable: 0 in 2.467 seconds.
Result using local variable:  0 in 1.663 seconds.

編輯這是前兩個帶開關-O2的片段的生成組件，差異最大的情況。 據我所知，它看起來像編譯器中的一個錯誤：0x3b9aca00是十六進制的SIZE，0x80496dc必須是全局的地址。 我檢查了一個較新的編譯器，這不再發生了。 然而，第二對片段的差異是相似的。

    void global1()
    {
        int i;
        global = 0;
        for (i = 0; i < SIZE; i++)
            global++;
    }

    void local1()
    {
        int i;
        long tmp = 0;
        for (i = 0; i < SIZE; i++)
            tmp++;
        global = tmp;
    }

    080483d0 <global1>:
     80483d0:   55                      push   %ebp
     80483d1:   89 e5                   mov    %esp,%ebp
     80483d3:   c7 05 dc 96 04 08 00    movl   $0x0,0x80496dc
     80483da:   00 00 00 
     80483dd:   b8 ff c9 9a 3b          mov    $0x3b9ac9ff,%eax
     80483e2:   89 f6                   mov    %esi,%esi
     80483e4:   83 e8 19                sub    $0x19,%eax
     80483e7:   79 fb                   jns    80483e4 <global1+0x14>
     80483e9:   c7 05 dc 96 04 08 00    movl   $0x3b9aca00,0x80496dc
     80483f0:   ca 9a 3b 
     80483f3:   c9                      leave  
     80483f4:   c3                      ret    
     80483f5:   8d 76 00                lea    0x0(%esi),%esi

    080483f8 <local1>:
     80483f8:   55                      push   %ebp
     80483f9:   89 e5                   mov    %esp,%ebp
     80483fb:   b8 ff c9 9a 3b          mov    $0x3b9ac9ff,%eax
     8048400:   48                      dec    %eax
     8048401:   79 fd                   jns    8048400 <local1+0x8>
     8048403:   c7 05 dc 96 04 08 00    movl   $0x3b9aca00,0x80496dc
     804840a:   ca 9a 3b 
     804840d:   c9                      leave  
     804840e:   c3                      ret    
     804840f:   90                      nop

最后這里是剩余片段的代碼，現在由gcc 4.3.3使用-O3生成（雖然舊版本似乎生成類似的代碼）。 看起來global2（..）實際上編譯為在循環的每次迭代中訪問全局內存位置的函數，其中local2（..）使用寄存器。 我仍然不清楚為什么gcc不會使用寄存器來優化全局版本。 這只是一個缺乏功能，還是會導致可執行文件的不可接受的行為？

    void global2(const char* start)
    {
        const char* p;
        global = 0;
        for (p = start; p < start + SIZE; p++)
            global += *p;
    }

    void local2(const char* start)
    {
        const char* p;
        long tmp = 0;
        for (p = start; p < start + SIZE; p++)
            tmp += *p;
        global = tmp;
    }

    08048470 <global2>:
     8048470:   55                      push   %ebp
     8048471:   31 d2                   xor    %edx,%edx
     8048473:   89 e5                   mov    %esp,%ebp
     8048475:   8b 4d 08                mov    0x8(%ebp),%ecx
     8048478:   c7 05 24 a0 04 08 00    movl   $0x0,0x804a024
     804847f:   00 00 00 
     8048482:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
     8048488:   0f be 04 11             movsbl (%ecx,%edx,1),%eax
     804848c:   83 c2 01                add    $0x1,%edx
     804848f:   01 05 24 a0 04 08       add    %eax,0x804a024
     8048495:   81 fa 00 ca 9a 3b       cmp    $0x3b9aca00,%edx
     804849b:   75 eb                   jne    8048488 <global2+0x18>
     804849d:   5d                      pop    %ebp
     804849e:   c3                      ret    
     804849f:   90                      nop    

    080484a0 <local2>:
     80484a0:   55                      push   %ebp
     80484a1:   31 c9                   xor    %ecx,%ecx
     80484a3:   89 e5                   mov    %esp,%ebp
     80484a5:   31 d2                   xor    %edx,%edx
     80484a7:   53                      push   %ebx
     80484a8:   8b 5d 08                mov    0x8(%ebp),%ebx
     80484ab:   90                      nop    
     80484ac:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
     80484b0:   0f be 04 13             movsbl (%ebx,%edx,1),%eax
     80484b4:   83 c2 01                add    $0x1,%edx
     80484b7:   01 c1                   add    %eax,%ecx
     80484b9:   81 fa 00 ca 9a 3b       cmp    $0x3b9aca00,%edx
     80484bf:   75 ef                   jne    80484b0 <local2+0x10>
     80484c1:   5b                      pop    %ebx
     80484c2:   89 0d 24 a0 04 08       mov    %ecx,0x804a024
     80484c8:   5d                      pop    %ebp
     80484c9:   c3                      ret    
     80484ca:   8d b6 00 00 00 00       lea    0x0(%esi),%esi

謝謝。

Answer 1

指針p不能指向其地址未被采用的局部變量tmp ，並且編譯器可以相應地進行優化。 除非它是static ，否則推斷未指向全局變量global更加困難，因為該全局變量的地址可以在另一個編譯單元中獲取並傳遞。

如果讀取程序集指示編譯器強制自己從內存中加載比你期望的更頻繁，並且你知道它所擔心的別名在實踐中不存在，你可以通過將全局變量復制到本地變量來幫助它。函數的頂部，並在函數的其余部分僅使用本地。

最后，請注意，如果指針p是另一種類型，編譯器可以調用“嚴格別名規則”進行優化，而不管它無法推斷p不指向global 。 但是因為char類型的左值通常用於觀察其他類型的表示，所以允許這種別名，並且編譯器不能在您的示例中使用此快捷方式。

Answer 2

全局變量=全局內存，並受到別名的影響（讀作：對於優化器不好 - 在最壞的情況下必須讀取 - 修改 - 寫入）。

局部變量=寄存器（除非編譯器真的無法幫助它，有時它也必須把它放在堆棧上，但堆棧實際上保證在L1中）

訪問寄存器的順序是一個周期，訪問內存大約為15-1000個周期（取決於緩存行是否在緩存中而不是由另一個內核無效，並且取決於頁面是否在TLB中））。

在不同優化級別以gcc / g ++訪問本地變量和全局變量的速度

問題描述

2 個解決方案

解決方案1
9 已采納 2011-08-30 09:38:26

解決方案2
9 2011-08-30 09:42:28

在不同優化級別以gcc / g ++訪問本地變量和全局變量的速度

問題描述

2 個解決方案

解決方案1 9 已采納 2011-08-30 09:38:26

解決方案2 9 2011-08-30 09:42:28

解決方案1
9 已采納 2011-08-30 09:38:26

解決方案2
9 2011-08-30 09:42:28