为什么我的SSE代码比本地C ++代码慢？

Question

First of all, I am new to SSE. 首先，我是SSE的新手。 I decided to accelerate my code, but it seems, that it works slower, then my native code. 我决定加速我的代码，但看起来它的运行速度比我的本机代码慢。

This is an example, that calculates the sum of squares. 这是一个示例，它计算平方和。 On my Intel i7-6700HQ, it takes 0.43s for native code and 0.52 for SSE. 在我的Intel i7-6700HQ上，本机代码需要0.43s，SSE需要0.52s。 So, where is a bottleneck? 那么，瓶颈在哪里？

inline float squared_sum(const float x, const float y)
{
    return x * x + y * y;
}

#define USE_SIMD

void calculations()
{
    high_resolution_clock::time_point t1, t2;

    int result_v = 0;

    t1 = high_resolution_clock::now();

    alignas(16) float data_x[4];
    alignas(16) float data_y[4];
    alignas(16) float result[4];
    __m128 v_x, v_y, v_res;
    for (int y = 0; y < 5120; y++)
    {
        data_y[0] = y;
        data_y[1] = y + 1;
        data_y[2] = y + 2;
        data_y[3] = y + 3;
        for (int x = 0; x < 5120; x++)
        {
            data_x[0] = x;
            data_x[1] = x + 1;
            data_x[2] = x + 2;
            data_x[3] = x + 3;
#ifdef USE_SIMD
            v_x = _mm_load_ps(data_x);
            v_y = _mm_load_ps(data_y);
            v_x = _mm_mul_ps(v_x, v_x);
            v_y = _mm_mul_ps(v_y, v_y);
            v_res = _mm_add_ps(v_x, v_y);
            _mm_store_ps(result, v_res);
#else
            result[0] = squared_sum(data_x[0], data_y[0]);
            result[1] = squared_sum(data_x[1], data_y[1]);
            result[2] = squared_sum(data_x[2], data_y[2]);
            result[3] = squared_sum(data_x[3], data_y[3]);
#endif

            result_v += (int)(result[0] + result[1] + result[2] + result[3]);
        }
    }

    t2 = high_resolution_clock::now();
    duration<double> time_span1 = duration_cast<duration<double>>(t2 - t1);
    std::cout << "Exec time:\t" << time_span1.count() << " s\n";
}

UPDATE: fixed code according to comments. 更新：根据注释固定代码。

I am using Visual Studio 2017. Compiled for x64. 我正在使用Visual Studio2017。针对x64编译。

Optimization: Maximum Optimization (Favor Speed) (/O2) ; 优化： 最大优化（最快速度）（/ O2） ；
Inline Function Expansion: Any Suitable (/Ob2) ; 内联函数扩展： 任意合适的（/ Ob2） ；
Favor Size or Speed: Favor fast code (/Ot) ; 支持大小或速度：支持快速代码（/ Ot） ；
Omit Frame Pointers: Yes (/Oy) 省略帧指针： 是（/ Oy）

Conclusion 结论

Compilers generate already optimized code, so nowadays it is hard to accelerate it even more. 编译器会生成已经优化的代码，因此如今很难进一步加速它。 The one thing you can do, to accelerate code more, is parallelization. 您可以做的一件事就是提高并行化速度，以加快代码的速度。

Thanks for the answers. 感谢您的回答。 They mainly the same, so I accept Søren V. Poulsen answer because it was the first. 它们基本上是相同的，所以我接受SørenV. Poulsen的回答，因为它是第一个。

Answer 1

Modern compiles are incredible machines and will already use SIMD instructions if possible (and with the correct compilation flags). 现代编译器是令人难以置信的机器，并且在可能的情况下（并且带有正确的编译标志）将已经使用SIMD指令。

One general strategy to determine what the compiler is doing is looking at the disassembly of your code. 确定编译器正在做什么的一种通用策略是查看代码的反汇编。 If you don't want to do it on your own machine you can use an online service like Godbolt: https://gcc.godbolt.org/z/T6GooQ . 如果您不想在自己的计算机上执行此操作，则可以使用像Godbolt这样的在线服务： https ://gcc.godbolt.org/z/T6GooQ。

One tip is to avoid atomic for storing intermediate results like you are doing here. 一个技巧是避免像您在此处那样atomic存储中间结果。 Atomic values are used to ensure synchronization between threads, and this may come at a very high computational cost, relatively speaking. 原子值用于确保线程之间的同步，相对而言，这可能需要很高的计算成本。

Answer 2

Looking through the assembly for the compiler's code based (without your SIMD stuff), 浏览程序集以查找基于编译器的代码（没有您的SIMD资料），

calculations():
        pxor    xmm2, xmm2
        xor     edx, edx
        movdqa  xmm0, XMMWORD PTR .LC0[rip]
        movdqa  xmm11, XMMWORD PTR .LC1[rip]
        movdqa  xmm9, XMMWORD PTR .LC2[rip]
        movdqa  xmm8, XMMWORD PTR .LC3[rip]
        movdqa  xmm7, XMMWORD PTR .LC4[rip]
.L4:
        movdqa  xmm5, xmm0
        movdqa  xmm4, xmm0
        cvtdq2ps        xmm6, xmm0
        movdqa  xmm10, xmm0
        paddd   xmm0, xmm7
        cvtdq2ps        xmm3, xmm0
        paddd   xmm5, xmm9
        paddd   xmm4, xmm8
        cvtdq2ps        xmm5, xmm5
        cvtdq2ps        xmm4, xmm4
        mulps   xmm6, xmm6
        mov     eax, 5120
        paddd   xmm10, xmm11
        mulps   xmm5, xmm5
        mulps   xmm4, xmm4
        mulps   xmm3, xmm3
        pxor    xmm12, xmm12
.L2:
        movdqa  xmm1, xmm12
        cvtdq2ps        xmm14, xmm12
        mulps   xmm14, xmm14
        movdqa  xmm13, xmm12
        paddd   xmm12, xmm7
        cvtdq2ps        xmm12, xmm12
        paddd   xmm1, xmm9
        cvtdq2ps        xmm0, xmm1
        mulps   xmm0, xmm0
        paddd   xmm13, xmm8
        cvtdq2ps        xmm13, xmm13
        sub     eax, 1
        mulps   xmm13, xmm13
        addps   xmm14, xmm6
        mulps   xmm12, xmm12
        addps   xmm0, xmm5
        addps   xmm13, xmm4
        addps   xmm12, xmm3
        addps   xmm0, xmm14
        addps   xmm0, xmm13
        addps   xmm0, xmm12
        movdqa  xmm12, xmm1
        cvttps2dq       xmm0, xmm0
        paddd   xmm2, xmm0
        jne     .L2
        add     edx, 1
        movdqa  xmm0, xmm10
        cmp     edx, 1280
        jne     .L4
        movdqa  xmm0, xmm2
        psrldq  xmm0, 8
        paddd   xmm2, xmm0
        movdqa  xmm0, xmm2
        psrldq  xmm0, 4
        paddd   xmm2, xmm0
        movd    eax, xmm2
        ret
main:
        xor     eax, eax
        ret
_GLOBAL__sub_I_calculations():
        sub     rsp, 8
        mov     edi, OFFSET FLAT:_ZStL8__ioinit
        call    std::ios_base::Init::Init() [complete object constructor]
        mov     edx, OFFSET FLAT:__dso_handle
        mov     esi, OFFSET FLAT:_ZStL8__ioinit
        mov     edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev
        add     rsp, 8
        jmp     __cxa_atexit
.LC0:
        .long   0
        .long   1
        .long   2
        .long   3
.LC1:
        .long   4
        .long   4
        .long   4
        .long   4
.LC2:
        .long   1
        .long   1
        .long   1
        .long   1
.LC3:
        .long   2
        .long   2
        .long   2
        .long   2
.LC4:
        .long   3
        .long   3
        .long   3
        .long   3

Your SIMD code generates: 您的SIMD代码生成：

calculations():
        pxor    xmm5, xmm5
        xor     eax, eax
        mov     r8d, 1
        movabs  rdi, -4294967296
        cvtsi2ss        xmm5, eax
.L4:
        mov     r9d, r8d
        mov     esi, 1
        movd    edx, xmm5
        pxor    xmm5, xmm5
        pxor    xmm4, xmm4
        mov     ecx, edx
        mov     rdx, QWORD PTR [rsp-24]
        cvtsi2ss        xmm5, r8d
        add     r8d, 1
        cvtsi2ss        xmm4, r8d
        and     rdx, rdi
        or      rdx, rcx
        pxor    xmm2, xmm2
        mov     edx, edx
        movd    ecx, xmm5
        sal     rcx, 32
        or      rdx, rcx
        mov     QWORD PTR [rsp-24], rdx
        movd    edx, xmm4
        pxor    xmm4, xmm4
        mov     ecx, edx
        mov     rdx, QWORD PTR [rsp-16]
        and     rdx, rdi
        or      rdx, rcx
        lea     ecx, [r9+2]
        mov     edx, edx
        cvtsi2ss        xmm4, ecx
        movd    ecx, xmm4
        sal     rcx, 32
        or      rdx, rcx
        mov     QWORD PTR [rsp-16], rdx
        movaps  xmm4, XMMWORD PTR [rsp-24]
        mulps   xmm4, xmm4
.L2:
        movd    edx, xmm2
        mov     r10d, esi
        pxor    xmm2, xmm2
        pxor    xmm7, xmm7
        mov     ecx, edx
        mov     rdx, QWORD PTR [rsp-40]
        cvtsi2ss        xmm2, esi
        add     esi, 1
        and     rdx, rdi
        cvtsi2ss        xmm7, esi
        or      rdx, rcx
        mov     ecx, edx
        movd    r11d, xmm2
        movd    edx, xmm7
        sal     r11, 32
        or      rcx, r11
        pxor    xmm7, xmm7
        mov     QWORD PTR [rsp-40], rcx
        mov     ecx, edx
        mov     rdx, QWORD PTR [rsp-32]
        and     rdx, rdi
        or      rdx, rcx
        lea     ecx, [r10+2]
        mov     edx, edx
        cvtsi2ss        xmm7, ecx
        movd    ecx, xmm7
        sal     rcx, 32
        or      rdx, rcx
        mov     QWORD PTR [rsp-32], rdx
        movaps  xmm0, XMMWORD PTR [rsp-40]
        mulps   xmm0, xmm0
        addps   xmm0, xmm4
        movaps  xmm3, xmm0
        movaps  xmm1, xmm0
        shufps  xmm3, xmm0, 85
        addss   xmm1, xmm3
        movaps  xmm3, xmm0
        unpckhps        xmm3, xmm0
        shufps  xmm0, xmm0, 255
        addss   xmm1, xmm3
        addss   xmm0, xmm1
        cvttss2si       edx, xmm0
        add     eax, edx
        cmp     r10d, 5120
        jne     .L2
        cmp     r9d, 5120
        jne     .L4
        rep ret
main:
        xor     eax, eax
        ret
_GLOBAL__sub_I_calculations():
        sub     rsp, 8
        mov     edi, OFFSET FLAT:_ZStL8__ioinit
        call    std::ios_base::Init::Init() [complete object constructor]
        mov     edx, OFFSET FLAT:__dso_handle
        mov     esi, OFFSET FLAT:_ZStL8__ioinit
        mov     edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev
        add     rsp, 8
        jmp     __cxa_atexit

Note that the compiler's version is using cvtdq2ps , paddd , cvtdq2ps , mulps , addps , and cvttps2dq . 请注意，编译器的版本使用的是cvtdq2ps ， paddd ， cvtdq2ps ， mulps ， addps和cvttps2dq 。 All of these are SIMD instructions. 所有这些都是SIMD指令。 By combining them effectively, the compiler generates fast code. 通过有效地组合它们，编译器将生成快速代码。

In constrast, your code generates a lot of add , and , cvtsi2ss , lea , mov , movd , or , pxor , sal , which are not SIMD instructions. 在constrast，你的代码产生大量的add ， and ， cvtsi2ss ， lea ， mov ， movd ， or ， pxor ， sal ，这是不是SIMD指令。

I suspect the compiler does a better job of dealing with data type conversion and data rearrangement than you do, and that this allows it to arrange its math more effectively. 我怀疑编译器在处理数据类型转换和数据重排方面比您做得更好，并且这可以使编译器更有效地安排其数学运算。

为什么我的SSE代码比本地C ++代码慢？

问题描述

UPDATE: fixed code according to comments. 更新：根据注释固定代码。

Conclusion 结论

2 个解决方案

解决方案1
3 已采纳 2019-02-07 20:41:02

解决方案2
2 2019-02-07 21:20:06

为什么我的SSE代码比本地C ++代码慢？

问题描述

UPDATE: fixed code according to comments. 更新：根据注释固定代码。

Conclusion 结论

2 个解决方案

解决方案1 3 已采纳 2019-02-07 20:41:02

解决方案2 2 2019-02-07 21:20:06

解决方案1
3 已采纳 2019-02-07 20:41:02

解决方案2
2 2019-02-07 21:20:06