C内在函数，SSE2点积和gcc -O3生成的汇编

Question

I need to write a dot product using SSE2 (no _mm_dp_ps nor _mm_hadd_ps) : 我需要使用SSE2编写一个点积（没有_mm_dp_ps也没有_mm_hadd_ps）：

#include <xmmintrin.h>

inline __m128 sse_dot4(__m128 a, __m128 b)
{
    const __m128 mult = _mm_mul_ps(a, b);
    const __m128 shuf1 = _mm_shuffle_ps(mult, mult, _MM_SHUFFLE(0, 3, 2, 1));
    const __m128 shuf2 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(1, 0, 3, 2));
    const __m128 shuf3 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(2, 1, 0, 3));

    return _mm_add_ss(_mm_add_ss(_mm_add_ss(mult, shuf1), shuf2), shuf3);
}

but I looked at the generated assembler with gcc 4.9 (experimental) -O3, and I get : 但我看了生成的汇编程序与gcc 4.9（实验）-O3，我得到：

    mulps   %xmm1, %xmm0
    movaps  %xmm0, %xmm3         //These lines
    movaps  %xmm0, %xmm2         //have no use
    movaps  %xmm0, %xmm1         //isn't it ?
    shufps  $57, %xmm0, %xmm3
    shufps  $78, %xmm0, %xmm2
    shufps  $147, %xmm0, %xmm1
    addss   %xmm3, %xmm0
    addss   %xmm2, %xmm0
    addss   %xmm1, %xmm0
    ret

I am wondering why gcc copy xmm0 in xmm1, 2 and 3... Here is the code I get using the flag : -march=native (looks better) 我想知道为什么gcc在xmm1,2和3中复制xmm0 ...这是我使用标志得到的代码：-march = native（看起来更好）

    vmulps  %xmm1, %xmm0, %xmm1
    vshufps $78, %xmm1, %xmm1, %xmm2
    vshufps $57, %xmm1, %xmm1, %xmm3
    vshufps $147, %xmm1, %xmm1, %xmm0
    vaddss  %xmm3, %xmm1, %xmm1
    vaddss  %xmm2, %xmm1, %xmm1
    vaddss  %xmm0, %xmm1, %xmm0
    ret

Answer 1

Here's a dot product using only original SSE instructions, that also swizzles the result across each element: 这是一个只使用原始SSE指令的点积，它也会在每个元素上调整结果：

inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);

    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(2, 3, 0, 1));
    v0 = _mm_add_ps(v0, v1);
    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(0, 1, 2, 3));
    v0 = _mm_add_ps(v0, v1);

    return v0;
}

It's 5 SIMD instructions (as opposed to 7), though with no real opportunity to hide latencies. 它是5个SIMD指令（而不是7个），但没有隐藏延迟的真正机会。 Any element will hold the result, eg, float f = _mm_cvtss_f32(sse_dot4(a, b); 任何元素都将保存结果，例如， float f = _mm_cvtss_f32(sse_dot4(a, b);

the haddps instruction has pretty awful latency. haddps指令的延迟非常糟糕。 With SSE3: 使用SSE3：

inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);

    v0 = _mm_hadd_ps(v0, v0);
    v0 = _mm_hadd_ps(v0, v0);

    return v0;
}

This is possibly slower, though it's only 3 SIMD instructions. 虽然它只有3条SIMD指令，但速度可能较慢。 If you can do more than one dot product at a time, you could interleave instructions in the first case. 如果您一次可以执行多个点积，则可以在第一种情况下交错指令。 Shuffle is very fast on more recent micro-architectures. 在最近的微架构上，Shuffle非常快。

Answer 2

The first listing you paste is for SSE architectures only. 您粘贴的第一个列表仅适用于SSE体系结构。 Most SSE instructions support only the two operand syntax: instructions are in the form of a = a OP b . 大多数SSE指令仅支持两种操作数语法：指令的形式为a = a OP b 。

In your code, a is mult . 在你的代码中， a是mult 。 So if no copy is made and passes mult ( xmm0 in your example) directly, its value will be overwritten and then lost for the remaining _mm_shuffle_ps instructions 因此，如果没有复制并直接传递mult （在您的示例中为xmm0 ），则其值将被覆盖，然后因剩余的_mm_shuffle_ps指令而丢失

By passing march=native in the second listing, you enabled AVX instructions. 通过在第二个列表中传递march=native ，您启用了AVX指令。 AVX enables SSE intructions to use the three operand syntax: c = a OP b . AVX使SSE intructions能够使用三个操作数语法： c = a OP b 。 In this case, none of the source operands has to be overwritten so you do not need the additional copies. 在这种情况下，不必覆盖任何源操作数，因此您不需要其他副本。

Answer 3

Let me suggest that if you're going to use SIMD to do a dot product then you try and find a way to operate on multiple vectors at once. 让我建议，如果您要使用SIMD来制作点积，那么您可以尝试找到一种方法同时对多个向量进行操作。 For example with SSE if you have four vectors and you want to take the dot product with a fixed vector then you arrange the data like (xxxx), (yyyy), (zzzz), (wwww) and add each SSE vector and get the result of four dot products at once. 例如，对于SSE，如果你有四个向量，并且你想要使用固定向量的点积，那么你排列数据，如（xxxx），（yyyy），（zzzz），（wwww），并添加每个SSE向量，并得到四个点产品的结果一次。 That will get your at 100% (four times speedup) efficiency and it's not limited to 4-component vectors, it is 100% efficient for n-component vectors as well. 这将使您获得100％（四倍的加速）效率，并且不仅限于4分量矢量，它对于n分量矢量也是100％有效的。 Here is an example which only uses SSE. 这是一个仅使用SSE的示例。

#include <xmmintrin.h>
#include <stdio.h>

void dot4x4(float *aosoa, float *b, float *out) {   
    __m128 vx = _mm_load_ps(&aosoa[0]);
    __m128 vy = _mm_load_ps(&aosoa[4]);
    __m128 vz = _mm_load_ps(&aosoa[8]);
    __m128 vw = _mm_load_ps(&aosoa[12]);
    __m128 brod1 = _mm_set1_ps(b[0]);
    __m128 brod2 = _mm_set1_ps(b[1]);
    __m128 brod3 = _mm_set1_ps(b[2]);
    __m128 brod4 = _mm_set1_ps(b[3]);
    __m128 dot4 = _mm_add_ps(
        _mm_add_ps(_mm_mul_ps(brod1, vx), _mm_mul_ps(brod2, vy)),
        _mm_add_ps(_mm_mul_ps(brod3, vz), _mm_mul_ps(brod4, vw)));
    _mm_store_ps(out, dot4);

}

int main() {
    float *aosoa = (float*)_mm_malloc(sizeof(float)*16, 16);
    /* initialize array to AoSoA vectors v1 =(0,1,2,3}, v2 = (4,5,6,7), v3 =(8,9,10,11), v4 =(12,13,14,15) */
    float a[] = {
        0,4,8,12,
        1,5,9,13,
        2,6,10,14,
        3,7,11,15,
    };
    for (int i=0; i<16; i++) aosoa[i] = a[i];

    float *out = (float*)_mm_malloc(sizeof(float)*4, 16);
    float b[] = {1,1,1,1};
    dot4x4(aosoa, b, out);
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);

    _mm_free(aosoa);
    _mm_free(out);
}

Answer 4

(In fact, and despite all the up-votes, the answers that were given at the time this question was posted did not fulfill the expectations I had. Here is the answer I was waiting for.) （事实上，尽管所有上升的选票，这个问题发布时给出的答案都没有达到我的期望。这是我等待的答案。）

The SSE instruction SSE指令

shufps $IMM, xmmA, xmmB

does not work as 不起作用

xmmB = f($IMM, xmmA) 
//set xmmB with xmmA's words shuffled according to $IMM

but as 但作为

xmmB = f($IMM, xmmA, xmmB) 
//set xmmB with 2 words of xmmA and 2 words of xmmB according to $IMM

this is why the copy of the mulps result from xmm0 to xmm1..3 is needed. 这就是为什么需要从xmm0到xmm1..3的mulps副本的副本。

C内在函数，SSE2点积和gcc -O3生成的汇编

问题描述

4 个解决方案

解决方案1
5 2013-06-08 22:44:53

解决方案2
4 2013-06-08 17:50:12

解决方案3
4

解决方案4
1 已采纳 2014-09-08 14:52:30

C内在函数，SSE2点积和gcc -O3生成的汇编

问题描述

4 个解决方案

解决方案1 5 2013-06-08 22:44:53

解决方案2 4 2013-06-08 17:50:12

解决方案3 4

解决方案4 1 已采纳 2014-09-08 14:52:30

解决方案1
5 2013-06-08 22:44:53

解决方案2
4 2013-06-08 17:50:12

解决方案3
4

解决方案4
1 已采纳 2014-09-08 14:52:30