简体   繁体   English

C内在函数,SSE2点积和gcc -O3生成的汇编

[英]C intrinsics, SSE2 dot product and gcc -O3 generated assembly

I need to write a dot product using SSE2 (no _mm_dp_ps nor _mm_hadd_ps) : 我需要使用SSE2编写一个点积(没有_mm_dp_ps也没有_mm_hadd_ps):

#include <xmmintrin.h>

inline __m128 sse_dot4(__m128 a, __m128 b)
{
    const __m128 mult = _mm_mul_ps(a, b);
    const __m128 shuf1 = _mm_shuffle_ps(mult, mult, _MM_SHUFFLE(0, 3, 2, 1));
    const __m128 shuf2 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(1, 0, 3, 2));
    const __m128 shuf3 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(2, 1, 0, 3));

    return _mm_add_ss(_mm_add_ss(_mm_add_ss(mult, shuf1), shuf2), shuf3);
}

but I looked at the generated assembler with gcc 4.9 (experimental) -O3, and I get : 但我看了生成的汇编程序与gcc 4.9(实验)-O3,我得到:

    mulps   %xmm1, %xmm0
    movaps  %xmm0, %xmm3         //These lines
    movaps  %xmm0, %xmm2         //have no use
    movaps  %xmm0, %xmm1         //isn't it ?
    shufps  $57, %xmm0, %xmm3
    shufps  $78, %xmm0, %xmm2
    shufps  $147, %xmm0, %xmm1
    addss   %xmm3, %xmm0
    addss   %xmm2, %xmm0
    addss   %xmm1, %xmm0
    ret

I am wondering why gcc copy xmm0 in xmm1, 2 and 3... Here is the code I get using the flag : -march=native (looks better) 我想知道为什么gcc在xmm1,2和3中复制xmm0 ...这是我使用标志得到的代码:-march = native(看起来更好)

    vmulps  %xmm1, %xmm0, %xmm1
    vshufps $78, %xmm1, %xmm1, %xmm2
    vshufps $57, %xmm1, %xmm1, %xmm3
    vshufps $147, %xmm1, %xmm1, %xmm0
    vaddss  %xmm3, %xmm1, %xmm1
    vaddss  %xmm2, %xmm1, %xmm1
    vaddss  %xmm0, %xmm1, %xmm0
    ret

Here's a dot product using only original SSE instructions, that also swizzles the result across each element: 这是一个只使用原始SSE指令的点积,它也会在每个元素上调整结果:

inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);

    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(2, 3, 0, 1));
    v0 = _mm_add_ps(v0, v1);
    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(0, 1, 2, 3));
    v0 = _mm_add_ps(v0, v1);

    return v0;
}

It's 5 SIMD instructions (as opposed to 7), though with no real opportunity to hide latencies. 它是5个SIMD指令(而不是7个),但没有隐藏延迟的真正机会。 Any element will hold the result, eg, float f = _mm_cvtss_f32(sse_dot4(a, b); 任何元素都将保存结果,例如, float f = _mm_cvtss_f32(sse_dot4(a, b);

the haddps instruction has pretty awful latency. haddps指令的延迟非常糟糕。 With SSE3: 使用SSE3:

inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);

    v0 = _mm_hadd_ps(v0, v0);
    v0 = _mm_hadd_ps(v0, v0);

    return v0;
}

This is possibly slower, though it's only 3 SIMD instructions. 虽然它只有3条SIMD指令,但速度可能较慢。 If you can do more than one dot product at a time, you could interleave instructions in the first case. 如果您一次可以执行多个点积,则可以在第一种情况下交错指令。 Shuffle is very fast on more recent micro-architectures. 在最近的微架构上,Shuffle非常快。

The first listing you paste is for SSE architectures only. 您粘贴的第一个列表仅适用于SSE体系结构。 Most SSE instructions support only the two operand syntax: instructions are in the form of a = a OP b . 大多数SSE指令仅支持两种操作数语法:指令的形式为a = a OP b

In your code, a is mult . 在你的代码中, amult So if no copy is made and passes mult ( xmm0 in your example) directly, its value will be overwritten and then lost for the remaining _mm_shuffle_ps instructions 因此,如果没有复制并直接传递mult (在您的示例中为xmm0 ),则其值将被覆盖,然后因剩余的_mm_shuffle_ps指令而丢失

By passing march=native in the second listing, you enabled AVX instructions. 通过在第二个列表中传递march=native ,您启用了AVX指令。 AVX enables SSE intructions to use the three operand syntax: c = a OP b . AVX使SSE intructions能够使用三个操作数语法: c = a OP b In this case, none of the source operands has to be overwritten so you do not need the additional copies. 在这种情况下,不必覆盖任何源操作数,因此您不需要其他副本。

Let me suggest that if you're going to use SIMD to do a dot product then you try and find a way to operate on multiple vectors at once. 让我建议,如果您要使用SIMD来制作点积,那么您可以尝试找到一种方法同时对多个向量进行操作。 For example with SSE if you have four vectors and you want to take the dot product with a fixed vector then you arrange the data like (xxxx), (yyyy), (zzzz), (wwww) and add each SSE vector and get the result of four dot products at once. 例如,对于SSE,如果你有四个向量,并且你想要使用固定向量的点积,那么你排列数据,如(xxxx),(yyyy),(zzzz),(wwww),并添加每个SSE向量,并得到四个点产品的结果一次。 That will get your at 100% (four times speedup) efficiency and it's not limited to 4-component vectors, it is 100% efficient for n-component vectors as well. 这将使您获得100%(四倍的加速)效率,并且不仅限于4分量矢量,它对于n分量矢量也是100%有效的。 Here is an example which only uses SSE. 这是一个仅使用SSE的示例。

#include <xmmintrin.h>
#include <stdio.h>

void dot4x4(float *aosoa, float *b, float *out) {   
    __m128 vx = _mm_load_ps(&aosoa[0]);
    __m128 vy = _mm_load_ps(&aosoa[4]);
    __m128 vz = _mm_load_ps(&aosoa[8]);
    __m128 vw = _mm_load_ps(&aosoa[12]);
    __m128 brod1 = _mm_set1_ps(b[0]);
    __m128 brod2 = _mm_set1_ps(b[1]);
    __m128 brod3 = _mm_set1_ps(b[2]);
    __m128 brod4 = _mm_set1_ps(b[3]);
    __m128 dot4 = _mm_add_ps(
        _mm_add_ps(_mm_mul_ps(brod1, vx), _mm_mul_ps(brod2, vy)),
        _mm_add_ps(_mm_mul_ps(brod3, vz), _mm_mul_ps(brod4, vw)));
    _mm_store_ps(out, dot4);

}

int main() {
    float *aosoa = (float*)_mm_malloc(sizeof(float)*16, 16);
    /* initialize array to AoSoA vectors v1 =(0,1,2,3}, v2 = (4,5,6,7), v3 =(8,9,10,11), v4 =(12,13,14,15) */
    float a[] = {
        0,4,8,12,
        1,5,9,13,
        2,6,10,14,
        3,7,11,15,
    };
    for (int i=0; i<16; i++) aosoa[i] = a[i];

    float *out = (float*)_mm_malloc(sizeof(float)*4, 16);
    float b[] = {1,1,1,1};
    dot4x4(aosoa, b, out);
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);

    _mm_free(aosoa);
    _mm_free(out);
}

(In fact, and despite all the up-votes, the answers that were given at the time this question was posted did not fulfill the expectations I had. Here is the answer I was waiting for.) (事实上​​,尽管所有上升的选票,这个问题发布时给出的答案都没有达到我的期望。这是我等待的答案。)

The SSE instruction SSE指令

shufps $IMM, xmmA, xmmB

does not work as 不起作用

xmmB = f($IMM, xmmA) 
//set xmmB with xmmA's words shuffled according to $IMM

but as 但作为

xmmB = f($IMM, xmmA, xmmB) 
//set xmmB with 2 words of xmmA and 2 words of xmmB according to $IMM

this is why the copy of the mulps result from xmm0 to xmm1..3 is needed. 这就是为什么需要从xmm0xmm1..3mulps副本的副本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM