gcc -O0在2的幂（矩阵换位）矩阵大小上表现优于-O3

Question

(For testing purposes) I have written a simple Method to calculate the transpose of a nxn Matrix （出于测试目的）我编写了一个简单的方法来计算nxn矩阵的转置

void transpose(const size_t _n, double* _A) {
    for(uint i=0; i < _n; ++i) {
        for(uint j=i+1; j < _n; ++j) {
            double tmp  = _A[i*_n+j];
            _A[i*_n+j] = _A[j*_n+i];
            _A[j*_n+i] = tmp;
        }
    }
}

When using optimization levels O3 or Ofast I expected the compiler to unroll some loops which would lead to higher performance especially when the matrix size is a multiple of 2 (ie, the double loop body can be performed each iteration) or similar. 当使用优化级别O3或Ofast时，我期望编译器展开一些循环，这将导致更高的性能，尤其是当矩阵大小是2的倍数（即，每次迭代可以执行双循环体）或类似时。 Instead what I measured was the exact opposite. 相反，我测量的恰恰相反。 Powers of 2 actually show a significant spike in execution time. 2的权力实际上表明执行时间显着增加。

These spikes are also at regular intervals of 64, more pronounced at intervals of 128 and so on. 这些尖峰也是64的固定间隔，间隔128更明显，依此类推。 Each spike extends to the neighboring matrix sizes like in the following table 每个尖峰延伸到相邻的矩阵大小，如下表所示

size n  time(us)
1020    2649
1021    2815
1022    3100
1023    5428
1024    15791
1025    6778
1026    3106
1027    2847
1028    2660
1029    3038
1030    2613

I compiled with a gcc version 4.8.2 but the same thing happens with a clang 3.5 so this might be some generic thing? 我使用gcc版本4.8.2编译但是同样的事情发生在clang 3.5上，所以这可能是一些通用的东西？

So my question basically is: Why is there this periodic increase in execution time? 所以我的问题基本上是：为什么执行时间周期性增加？ Is it some generic thing coming with any of the optimization options (as it happens with clang and gcc alike)? 是否有一些通用的东西与任何优化选项一起出现（就像clang和gcc一样）？ If so which optimization option is causing this? 如果是这样的优化选项导致了这个？

And how can this be so significant that even the O0 version of the program outperforms the 03 version at multiples of 512? 这怎么可能如此重要，即使O0版本的程序在512的倍数时优于03版本？

执行时间与O0和O3的矩阵大小

EDIT: Note the magnitude of the spikes in this (logarithmic) plot. 编辑：注意此（对数）图中峰值的大小。 Transposing a 1024x1024 matrix with optimization actually takes as much time as transposing a 1300x1300 matrix without optimization. 转换具有优化的1024x1024矩阵实际上花费的时间与在没有优化的情况下转置1300x1300矩阵一样多。 If this is a cache-fault / page-fault problem, then someone needs to explain to me why the memory layout is so significantly different for the optimized version of the program, that it fails for powers of two, just to recover high performance for slightly larger matrices. 如果这是一个缓存故障/页面错误问题，那么有人需要向我解释为什么内存布局对于程序的优化版本来说是如此显着不同，它失败了2的权限，只是为了恢复高性能稍大的矩阵。 Shouldn't cache-faults create more of a step-like pattern? 缓存故障是否应该创建更多类似步骤的模式？ Why does the execution times go down again at all? 为什么执行时间会再次下降？ (and why should optimization create cache-faults that weren't there before?) （为什么优化会创建以前不存在的缓存错误？）

EDIT: the following should be the assembler codes that gcc produced 编辑：以下应该是gcc生成的汇编代码

no optimization (O0): 没有优化（O0）：

_Z9transposemRPd:
.LFB0:
    .cfi_startproc
    push    rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    mov rbp, rsp
    .cfi_def_cfa_register 6
    mov QWORD PTR [rbp-24], rdi
    mov QWORD PTR [rbp-32], rsi
    mov DWORD PTR [rbp-4], 0
    jmp .L2
.L5:
    mov eax, DWORD PTR [rbp-4]
    add eax, 1
    mov DWORD PTR [rbp-8], eax
    jmp .L3
.L4:
    mov rax, QWORD PTR [rbp-32]
    mov rdx, QWORD PTR [rax]
    mov eax, DWORD PTR [rbp-4]
    imul    rax, QWORD PTR [rbp-24]
    mov rcx, rax
    mov eax, DWORD PTR [rbp-8]
    add rax, rcx
    sal rax, 3
    add rax, rdx
    mov rax, QWORD PTR [rax]
    mov QWORD PTR [rbp-16], rax
    mov rax, QWORD PTR [rbp-32]
    mov rdx, QWORD PTR [rax]
    mov eax, DWORD PTR [rbp-4]
    imul    rax, QWORD PTR [rbp-24]
    mov rcx, rax
    mov eax, DWORD PTR [rbp-8]
    add rax, rcx
    sal rax, 3
    add rdx, rax
    mov rax, QWORD PTR [rbp-32]
    mov rcx, QWORD PTR [rax]
    mov eax, DWORD PTR [rbp-8]
    imul    rax, QWORD PTR [rbp-24]
    mov rsi, rax
    mov eax, DWORD PTR [rbp-4]
    add rax, rsi
    sal rax, 3
    add rax, rcx
    mov rax, QWORD PTR [rax]
    mov QWORD PTR [rdx], rax
    mov rax, QWORD PTR [rbp-32]
    mov rdx, QWORD PTR [rax]
    mov eax, DWORD PTR [rbp-8]
    imul    rax, QWORD PTR [rbp-24]
    mov rcx, rax
    mov eax, DWORD PTR [rbp-4]
    add rax, rcx
    sal rax, 3
    add rdx, rax
    mov rax, QWORD PTR [rbp-16]
    mov QWORD PTR [rdx], rax
    add DWORD PTR [rbp-8], 1
.L3:
    mov eax, DWORD PTR [rbp-8]
    cmp rax, QWORD PTR [rbp-24]
    jb  .L4
    add DWORD PTR [rbp-4], 1
.L2:
    mov eax, DWORD PTR [rbp-4]
    cmp rax, QWORD PTR [rbp-24]
    jb  .L5
    pop rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   _Z9transposemRPd, .-_Z9transposemRPd
    .ident  "GCC: (Debian 4.8.2-15) 4.8.2"
    .section    .note.GNU-stack,"",@progbits

with optimization (O3) 优化（O3）

_Z9transposemRPd:
.LFB0:
    .cfi_startproc
    push    rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    xor r11d, r11d
    xor ebx, ebx
.L2:
    cmp r11, rdi
    mov r9, r11
    jae .L10
    .p2align 4,,10
    .p2align 3
.L7:
    add ebx, 1
    mov r11d, ebx
    cmp rdi, r11
    mov rax, r11
    jbe .L2
    mov r10, r9
    mov r8, QWORD PTR [rsi]
    mov edx, ebx
    imul    r10, rdi
    .p2align 4,,10
    .p2align 3
.L6:
    lea rcx, [rax+r10]
    add edx, 1
    imul    rax, rdi
    lea rcx, [r8+rcx*8]
    movsd   xmm0, QWORD PTR [rcx]
    add rax, r9
    lea rax, [r8+rax*8]
    movsd   xmm1, QWORD PTR [rax]
    movsd   QWORD PTR [rcx], xmm1
    movsd   QWORD PTR [rax], xmm0
    mov eax, edx
    cmp rdi, rax
    ja  .L6
    cmp r11, rdi
    mov r9, r11
    jb  .L7
.L10:
    pop rbx
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
.LFE0:
    .size   _Z9transposemRPd, .-_Z9transposemRPd
    .ident  "GCC: (Debian 4.8.2-15) 4.8.2"
    .section    .note.GNU-stack,"",@progbits

Answer 1

The periodic increase of execution time must be due to the cache being only N-way associative instead of fully associative. 执行时间的周期性增加必须归因于缓存仅是N路关联而不是完全关联。 You are witnessing hash collision related to cache line selection algorithm. 您正在目睹与缓存行选择算法相关的哈希冲突。

The fastest L1 cache has a smaller number of cache lines than the next level L2. 最快的L1缓存具有比下一级L2更少的缓存行数。 In each level each cache line can be filled only from a limited set of sources. 在每个级别中，每个高速缓存行只能从有限的一组源中填充。

Typical HW implementations of cache line selection algorithms will just use few bits from the memory address to determine in which cache slot the data should be written -- in HW bit shifts are free. 高速缓存行选择算法的典型硬件实现将仅使用来自存储器地址的少量位来确定应在哪个高速缓存槽中写入数据 - 在HW位移位中是空闲的。

This causes a competition between memory ranges eg between addresses 0x300010 and 0x341010. 这导致存储器范围之间的竞争，例如在地址0x300010和0x341010之间。 In fully sequential algorithm this doesn't matter -- N is large enough for practically all algorithms of the form: 在完全顺序算法中，这无关紧要 - N对于几乎所有形式的算法都足够大：

 for (i=0;i<1000;i++) a[i] += b[i] * c[i] + d[i];

But when the number of the inputs (or outputs) gets larger, which happens internally when the algorithm is optimized, having one input in the cache forces another input out of the cache. 但是当输入（或输出）的数量变大时，这在算法被优化时在内部发生，在高速缓存中具有一个输入迫使另一输入离开高速缓存。

 // one possible method of optimization with 2 outputs and 6 inputs
 // with two unrelated execution paths -- should be faster, but maybe it isn't
 for (i=0;i<500;i++) { 
       a[i]     += b[i]     * c[i]     + d[i];
       a[i+500] += b[i+500] * c[i+500] + d[i+500];
 }

A graph in Example 5: Cache Associativity illustrates 512 byte offset between matrix lines being a global worst case dimension for the particular system. 示例5中的图表：高速缓存关联性说明矩阵行之间的512字节偏移是特定系统的全局最坏情况维度。 When this is known, a working mitigation is to over-allocate the matrix horizontally to some other dimension char matrix[512][512 + 64] . 当这已知时，工作缓解是将矩阵水平过度分配到某个其他维度char matrix[512][512 + 64] 。

Answer 2

The improvement in performance is likely related to CPU/RAM caching. 性能的提高可能与CPU / RAM缓存有关。

When the data is not a power of 2, a cache line load (like 16, 32, or 64 words) transfers more than the data that is required tying up the bus—uselessly as it turns out. 当数据不是2的幂时，高速缓存行负载（如16,32或64个字）的转移比捆绑总线所需的数据传输更多 - 结果是无用的。 For a data set which is a power of 2, all of the pre-fetched data is used. 对于功率为2的数据集，使用所有预取数据。

I bet if you were to disable L1 and L2 caching, the performance would be completely smooth and predictable. 我打赌如果你要禁用L1和L2缓存，性能将是完全平滑和可预测的。 But it would be much slower. 但它会慢得多。 Caching really helps performance! 缓存真的有助于提高性能！

Answer 3

Comment with code: In the -O3 case, with 用代码注释：在-O3情况下，用

#include <cstdlib>

extern void transpose(const size_t n, double* a)
{
    for (size_t i = 0; i < n; ++i) {
        for (size_t j = i + 1; j < n; ++j) {
            std::swap(a[i * n + j], a[j * n + i]); // or your expanded version.
        }
    }
}

compiling with 用...编译

$ g++ --version
g++ (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1
...
$ g++ -g1 -std=c++11 -Wall -o test.S -S test.cpp -O3

I get 我明白了

_Z9transposemPd:
.LFB68:
    .cfi_startproc
.LBB2:
    testq   %rdi, %rdi
    je  .L1
    leaq    8(,%rdi,8), %r10
    xorl    %r8d, %r8d
.LBB3:
    addq    $1, %r8
    leaq    -8(%r10), %rcx
    cmpq    %rdi, %r8
    leaq    (%rsi,%rcx), %r9
    je  .L1
    .p2align 4,,10
    .p2align 3
.L10:
    movq    %r9, %rdx
    movq    %r8, %rax
    .p2align 4,,10
    .p2align 3
.L5:
.LBB4:
    movsd   (%rdx), %xmm1
    movsd   (%rsi,%rax,8), %xmm0
    movsd   %xmm1, (%rsi,%rax,8)
.LBE4:
    addq    $1, %rax
.LBB5:
    movsd   %xmm0, (%rdx)
    addq    %rcx, %rdx
.LBE5:
    cmpq    %rdi, %rax
    jne .L5
    addq    $1, %r8
    addq    %r10, %r9
    addq    %rcx, %rsi
    cmpq    %rdi, %r8
    jne .L10
.L1:
    rep ret
.LBE3:
.LBE2:
    .cfi_endproc

And something quite different if I add -m32. 如果我添加-m32，那就完全不同了。

(Note: it makes no difference to the assembly whether I use std::swap or your variant) （注意：无论我使用std :: swap还是你的变体，它对程序集没有任何影响）

In order to understand what is causing the spikes, though, you probably want to visualize the memory operations going on. 但是，为了理解导致峰值的原因，您可能希望可视化正在进行的内存操作。

Answer 4

To add to others: g++ -std=c++11 -march=core2 -O3 -c -S - gcc version 4.8.2 (MacPorts gcc48 4.8.2_0) - x86_64-apple-darwin13.0.0 : 要添加到其他人： g++ -std=c++11 -march=core2 -O3 -c -S -gcc version 4.8.2（MacPorts gcc48 4.8.2_0） - x86_64-apple-darwin13.0.0：

__Z9transposemPd:
LFB0:
        testq   %rdi, %rdi
        je      L1
        leaq    8(,%rdi,8), %r10
        xorl    %r8d, %r8d
        leaq    -8(%r10), %rcx
        addq    $1, %r8
        leaq    (%rsi,%rcx), %r9
        cmpq    %rdi, %r8
        je      L1
        .align 4,0x90
L10:
        movq    %r9, %rdx
        movq    %r8, %rax
        .align 4,0x90
L5:
        movsd   (%rdx), %xmm0
        movsd   (%rsi,%rax,8), %xmm1
        movsd   %xmm0, (%rsi,%rax,8)
        addq    $1, %rax
        movsd   %xmm1, (%rdx)
        addq    %rcx, %rdx
        cmpq    %rdi, %rax
        jne     L5
        addq    $1, %r8
        addq    %r10, %r9
        addq    %rcx, %rsi
        cmpq    %rdi, %r8
        jne     L10
L1:
        rep; ret

Basically the same as @ksfone's code, for: 与@ ksfone的代码基本相同，用于：

#include <cstddef>

void transpose(const size_t _n, double* _A) {
    for(size_t i=0; i < _n; ++i) {
        for(size_t j=i+1; j < _n; ++j) {
            double tmp  = _A[i*_n+j];
            _A[i*_n+j] = _A[j*_n+i];
            _A[j*_n+i] = tmp;
        }
    }
}

Apart from the Mach-O 'as' differences (extra underscore, align and DWARF locations), it's the same. 除了Mach-O'作为'差异（额外的下划线，对齐和DWARF位置），它也是一样的。 But very different from the OP's assembly output. 但与OP的装配输出有很大不同。 A much 'tighter' inner loop. 一个更“紧密”的内循环。

gcc -O0在2的幂（矩阵换位）矩阵大小上表现优于-O3

问题描述

4 个解决方案

解决方案1
7 已采纳 2014-02-18 07:29:44

解决方案2
0 2014-02-18 01:38:35

解决方案3
0 2014-02-18 04:04:04

解决方案4
0 2014-02-18 06:42:03

gcc -O0在2的幂（矩阵换位）矩阵大小上表现优于-O3

问题描述

4 个解决方案

解决方案1 7 已采纳 2014-02-18 07:29:44

解决方案2 0 2014-02-18 01:38:35

解决方案3 0 2014-02-18 04:04:04

解决方案4 0 2014-02-18 06:42:03

解决方案1
7 已采纳 2014-02-18 07:29:44

解决方案2
0 2014-02-18 01:38:35

解决方案3
0 2014-02-18 04:04:04

解决方案4
0 2014-02-18 06:42:03