SSE 比常规 function 慢得多

Question

i am making Julia set visualisation using SSE.我正在使用 SSE 制作 Julia 集可视化。 here is my code class and operators这是我的代码 class 和运营商

class vec4 {
    public:
        inline vec4(void) {}
        inline vec4(__m128 val) :v(val) {}

        __m128 v;

        inline void operator=(float *a) {v=_mm_load_ps(a);}
        inline vec4(float *a) {(*this)=a;} 
        inline vec4(float a) {(*this)=a;}

        inline void operator=(float a) {v=_mm_load1_ps(&a);}

};

inline vec4 operator+(const vec4 &a,const vec4 &b) { return _mm_add_ps(a.v,b.v); }
inline vec4 operator-(const vec4 &a,const vec4 &b) { return _mm_sub_ps(a.v,b.v); }
inline vec4 operator*(const vec4 &a,const vec4 &b) { return _mm_mul_ps(a.v,b.v); }
inline vec4 operator/(const vec4 &a,const vec4 &b) { return _mm_div_ps(a.v,b.v); }
inline vec4 operator++(const vec4 &a)
{
    __declspec(align(16)) float b[4]={1.0f,1.0f,1.0f,1.0f};
    vec4 B(b);
    return _mm_add_ps(a.v,B.v); 
}

function itself: function 本身：

vec4 TWO(2.0f);
vec4 FOUR(4.0f);
vec4 ZER(0.0f);

vec4 CR(cR);
vec4 CI(cI);

for (int i=0; i<320; i++) //H
{
    float *pr = (float*) _aligned_malloc(4 * sizeof(float), 16); //dynamic

    __declspec(align(16)) float pi=i*ratioY + startY;

    for (int j=0; j<420; j+=4) //W
    {

        pr[0]=j*ratioX + startX;
        for(int x=1;x<4;x++)
        {
            pr[x]=pr[x-1]+ratioX;
        }

        vec4 ZR(pr);
        vec4 ZI(pi);

        __declspec(align(16)) float color[4]={0.0f,0.0f,0.0f,0.0f};

        vec4 COLOR(color);
        vec4 COUNT(0.0f);

        __m128 MASK=ZER.v;

        int _count;
        enum {max_count=100};
        for (_count=0;_count<=max_count;_count++) 
        {

            vec4 tZR=ZR*ZR-ZI*ZI+CR;
            vec4 tZI=TWO*ZR*ZI+CI;
            vec4 LEN=tZR*tZR+tZI*tZI;

            __m128 MASKOLD=MASK;
            MASK=_mm_cmplt_ps(LEN.v,FOUR.v);

            ZR=_mm_or_ps(_mm_and_ps(MASK,tZR.v),_mm_andnot_ps(MASK,ZR.v));
            ZI=_mm_or_ps(_mm_and_ps(MASK,tZI.v),_mm_andnot_ps(MASK,ZI.v));

            __m128 CHECKNOTEQL=_mm_cmpneq_ps(MASK,MASKOLD);    
            COLOR=_mm_or_ps(_mm_and_ps(CHECKNOTEQL,COUNT.v),_mm_andnot_ps(CHECKNOTEQL,COLOR.v));

            COUNT=COUNT++;
            operations+=17;

            if (_mm_movemask_ps((LEN-FOUR).v)==0) break; 
        }
        _mm_store_ps(color,COLOR.v);

SSE needs 553k operations (mull,add,if) and takes ~320ms to finish the task from the other hand regular function takes 1428k operations but need only ~90ms to compute? SSE 需要 553k 次操作（mull、add、if）并且需要 ~320ms 才能完成任务，而另一方面常规 function 需要 1428k 次操作但只需要~90ms 来计算？ I used vs2010 performance analyser and seems that all maths operators are running rly slow.我使用了 vs2010 性能分析器，似乎所有数学运算符都运行缓慢。 What I am doing wrong?我做错了什么？

Answer 1

The problem you are having is that the SSE intrinics are doing far more memory operations than the non-SSE version.您遇到的问题是 SSE 内部函数比非 SSE 版本执行更多 memory 操作。 Using your vector class I wrote this:使用你的矢量 class 我写了这个：

int main (int argc, char *argv [])
{
  vec4 a (static_cast <float> (argc));
  cout << "argc = " << argc << endl;
  a=++a;
  cout << "a = (" << a.v.m128_f32 [0] << ", " << a.v.m128_f32 [1] << ", " << a.v.m128_f32 [2] << ", " << a.v.m128_f32 [3] << ", " << ")" << endl;
}

which produced the following operations in a release build (I've edited this to show the SSE only):它在发布版本中产生了以下操作（我编辑了它以仅显示 SSE）：

fild        dword ptr [ebp+8] // load argc into FPU
fstp        dword ptr [esp+10h] // save argc as a float

movss       xmm0,dword ptr [esp+10h] // load argc into SSE
shufps      xmm0,xmm0,0 // copy argc to all values in SSE register
movaps      xmmword ptr [esp+20h],xmm0 // save to stack frame

fld1 // load 1 into FPU
fst         dword ptr [esp+20h] 
fst         dword ptr [esp+28h] 
fst         dword ptr [esp+30h] 
fstp        dword ptr [esp+38h] // create a (1,1,1,1) vector
movaps      xmm0,xmmword ptr [esp+2Ch] // load above vector into SSE
addps       xmm0,xmmword ptr [esp+1Ch] // add to vector a
movaps      xmmword ptr [esp+38h],xmm0 // save back to a

Note: the addresses are relative to ESP and there are a few pushes which explains the weird changes of offset for the same value.注意：地址是相对于 ESP 的，并且有一些推送可以解释相同值的偏移量的奇怪变化。

Now, compare the code to this version:现在，将代码与此版本进行比较：

int main (int argc, char *argv [])
{
  float a[4];
  for (int i = 0 ; i < 4 ; ++i)
  {
    a [i] = static_cast <float> (argc + i);
  }
  cout << "argc = " << argc << endl;
  for (int i = 0 ; i < 4 ; ++i)
  {
    a [i] += 1.0f;
  }
  cout << "a = (" << a [0] << ", " << a [1] << ", " << a [2] << ", " << a [3] << ", " << ")" << endl;
}

The compiler created this code for the above (again, edited and with weird offsets)编译器为上面的代码创建了这段代码（同样，经过编辑并带有奇怪的偏移量）

fild        dword ptr [argc] // converting argc to floating point values
fstp        dword ptr [esp+8] 
fild        dword ptr [esp+4] // the argc+i is done in the integer unit
fstp        dword ptr [esp+0Ch] 
fild        dword ptr [esp+8] 
fstp        dword ptr [esp+18h]
fild        dword ptr [esp+10h]
fstp        dword ptr [esp+24h] // array a now initialised

fld         dword ptr [esp+8] // load a[0]
fld1 // load 1 into FPU
fadd        st(1),st // increment a[0]
fxch        st(1)
fstp        dword ptr [esp+14h] // save a[0]
fld         dword ptr [esp+1Ch] // load a[1]
fadd        st,st(1) // increment a[1]
fstp        dword ptr [esp+24h] // save a[1]
fld         dword ptr [esp+28h] // load a[2]
fadd        st,st(1) // increment a[2]
fstp        dword ptr [esp+28h]  // save a[2]
fadd        dword ptr [esp+2Ch] // increment a[3]
fstp        dword ptr [esp+2Ch] // save a[3]

In terms of memory access, the increment requires:就memory接入而言，增量需要：

SSE                  FPU
4xfloat write        1xfloat read
1xsse read           1xfloat write
1xsse read+add       1xfloat read
1xsse write          1xfloat write
                     1xfloat read
                     1xfloat write
                     1xfloat read
                     1xfloat write

total
8 float reads        4 float reads
8 float writes       4 float writes

This shows the SSE is using twice the memory bandwidth of the FPU version and memory bandwidth is a major bottleneck.这表明 SSE 正在使用 FPU 版本的两倍 memory 带宽，而 memory 带宽是一个主要瓶颈。

If you want to seriously maximise the SSE then you need to write the whole aglorithm in a single SSE assembler function so that you can eliminate the memory read/writes as much as possible.如果你想认真地最大化 SSE，那么你需要在一个 SSE 汇编器 function 中编写整个算法，这样你就可以尽可能地消除 memory 读/写。 Using the intrinsics is not an ideal solution for optimisation.使用内在函数不是优化的理想解决方案。

Answer 2

here is an another example(Mandelbrot Sets) which is almost same to mine way of implementation of the Julia set algoritm http://pastebin.com/J90paPVC based on http://www.iquilezles.org/www/articles/sse/sse.htm .这是另一个示例（Mandelbrot 集），它与我的 Julia 集算法的实现方式几乎相同http://pastebin.com/J90paPVC基于http://www.iquilezles.org/www/articles/sse/ sse.htm 。 same story FPU>SSE I even skip some irrelevant operations.同样的故事 FPU>SSE 我什至跳过了一些不相关的操作。 any ideas how to do it right?任何想法如何做对？

SSE 比常规 function 慢得多

问题描述

2 个解决方案

解决方案1
8 已采纳 2012-04-11 09:58:36

解决方案2
0 2012-04-11 17:37:28

SSE 比常规 function 慢得多

问题描述

2 个解决方案

解决方案1 8 已采纳 2012-04-11 09:58:36

解决方案2 0 2012-04-11 17:37:28

解决方案1
8 已采纳 2012-04-11 09:58:36

解决方案2
0 2012-04-11 17:37:28