简体   繁体   English

SSE 比常规 function 慢得多

[英]SSE much slower than regular function

i am making Julia set visualisation using SSE.我正在使用 SSE 制作 Julia 集可视化。 here is my code class and operators这是我的代码 class 和运营商

class vec4 {
    public:
        inline vec4(void) {}
        inline vec4(__m128 val) :v(val) {}

        __m128 v;

        inline void operator=(float *a) {v=_mm_load_ps(a);}
        inline vec4(float *a) {(*this)=a;} 
        inline vec4(float a) {(*this)=a;}

        inline void operator=(float a) {v=_mm_load1_ps(&a);}

};

inline vec4 operator+(const vec4 &a,const vec4 &b) { return _mm_add_ps(a.v,b.v); }
inline vec4 operator-(const vec4 &a,const vec4 &b) { return _mm_sub_ps(a.v,b.v); }
inline vec4 operator*(const vec4 &a,const vec4 &b) { return _mm_mul_ps(a.v,b.v); }
inline vec4 operator/(const vec4 &a,const vec4 &b) { return _mm_div_ps(a.v,b.v); }
inline vec4 operator++(const vec4 &a)
{
    __declspec(align(16)) float b[4]={1.0f,1.0f,1.0f,1.0f};
    vec4 B(b);
    return _mm_add_ps(a.v,B.v); 
}

function itself: function 本身:

vec4 TWO(2.0f);
vec4 FOUR(4.0f);
vec4 ZER(0.0f);

vec4 CR(cR);
vec4 CI(cI);

for (int i=0; i<320; i++) //H
{
    float *pr = (float*) _aligned_malloc(4 * sizeof(float), 16); //dynamic

    __declspec(align(16)) float pi=i*ratioY + startY;

    for (int j=0; j<420; j+=4) //W
    {

        pr[0]=j*ratioX + startX;
        for(int x=1;x<4;x++)
        {
            pr[x]=pr[x-1]+ratioX;
        }

        vec4 ZR(pr);
        vec4 ZI(pi);

        __declspec(align(16)) float color[4]={0.0f,0.0f,0.0f,0.0f};

        vec4 COLOR(color);
        vec4 COUNT(0.0f);

        __m128 MASK=ZER.v;

        int _count;
        enum {max_count=100};
        for (_count=0;_count<=max_count;_count++) 
        {

            vec4 tZR=ZR*ZR-ZI*ZI+CR;
            vec4 tZI=TWO*ZR*ZI+CI;
            vec4 LEN=tZR*tZR+tZI*tZI;

            __m128 MASKOLD=MASK;
            MASK=_mm_cmplt_ps(LEN.v,FOUR.v);

            ZR=_mm_or_ps(_mm_and_ps(MASK,tZR.v),_mm_andnot_ps(MASK,ZR.v));
            ZI=_mm_or_ps(_mm_and_ps(MASK,tZI.v),_mm_andnot_ps(MASK,ZI.v));

            __m128 CHECKNOTEQL=_mm_cmpneq_ps(MASK,MASKOLD);    
            COLOR=_mm_or_ps(_mm_and_ps(CHECKNOTEQL,COUNT.v),_mm_andnot_ps(CHECKNOTEQL,COLOR.v));

            COUNT=COUNT++;
            operations+=17;

            if (_mm_movemask_ps((LEN-FOUR).v)==0) break; 
        }
        _mm_store_ps(color,COLOR.v);

SSE needs 553k operations (mull,add,if) and takes ~320ms to finish the task from the other hand regular function takes 1428k operations but need only ~90ms to compute? SSE 需要 553k 次操作(mull、add、if)并且需要 ~320ms 才能完成任务,而另一方面常规 function 需要 1428k 次操作但只需要~90ms 来计算? I used vs2010 performance analyser and seems that all maths operators are running rly slow.我使用了 vs2010 性能分析器,似乎所有数学运算符都运行缓慢。 What I am doing wrong?我做错了什么?

The problem you are having is that the SSE intrinics are doing far more memory operations than the non-SSE version.您遇到的问题是 SSE 内部函数比非 SSE 版本执行更多 memory 操作。 Using your vector class I wrote this:使用你的矢量 class 我写了这个:

int main (int argc, char *argv [])
{
  vec4 a (static_cast <float> (argc));
  cout << "argc = " << argc << endl;
  a=++a;
  cout << "a = (" << a.v.m128_f32 [0] << ", " << a.v.m128_f32 [1] << ", " << a.v.m128_f32 [2] << ", " << a.v.m128_f32 [3] << ", " << ")" << endl;
}

which produced the following operations in a release build (I've edited this to show the SSE only):它在发布版本中产生了以下操作(我编辑了它以仅显示 SSE):

fild        dword ptr [ebp+8] // load argc into FPU
fstp        dword ptr [esp+10h] // save argc as a float

movss       xmm0,dword ptr [esp+10h] // load argc into SSE
shufps      xmm0,xmm0,0 // copy argc to all values in SSE register
movaps      xmmword ptr [esp+20h],xmm0 // save to stack frame

fld1 // load 1 into FPU
fst         dword ptr [esp+20h] 
fst         dword ptr [esp+28h] 
fst         dword ptr [esp+30h] 
fstp        dword ptr [esp+38h] // create a (1,1,1,1) vector
movaps      xmm0,xmmword ptr [esp+2Ch] // load above vector into SSE
addps       xmm0,xmmword ptr [esp+1Ch] // add to vector a
movaps      xmmword ptr [esp+38h],xmm0 // save back to a

Note: the addresses are relative to ESP and there are a few pushes which explains the weird changes of offset for the same value.注意:地址是相对于 ESP 的,并且有一些推送可以解释相同值的偏移量的奇怪变化。

Now, compare the code to this version:现在,将代码与此版本进行比较:

int main (int argc, char *argv [])
{
  float a[4];
  for (int i = 0 ; i < 4 ; ++i)
  {
    a [i] = static_cast <float> (argc + i);
  }
  cout << "argc = " << argc << endl;
  for (int i = 0 ; i < 4 ; ++i)
  {
    a [i] += 1.0f;
  }
  cout << "a = (" << a [0] << ", " << a [1] << ", " << a [2] << ", " << a [3] << ", " << ")" << endl;
}

The compiler created this code for the above (again, edited and with weird offsets)编译器为上面的代码创建了这段代码(同样,经过编辑并带有奇怪的偏移量)

fild        dword ptr [argc] // converting argc to floating point values
fstp        dword ptr [esp+8] 
fild        dword ptr [esp+4] // the argc+i is done in the integer unit
fstp        dword ptr [esp+0Ch] 
fild        dword ptr [esp+8] 
fstp        dword ptr [esp+18h]
fild        dword ptr [esp+10h]
fstp        dword ptr [esp+24h] // array a now initialised

fld         dword ptr [esp+8] // load a[0]
fld1 // load 1 into FPU
fadd        st(1),st // increment a[0]
fxch        st(1)
fstp        dword ptr [esp+14h] // save a[0]
fld         dword ptr [esp+1Ch] // load a[1]
fadd        st,st(1) // increment a[1]
fstp        dword ptr [esp+24h] // save a[1]
fld         dword ptr [esp+28h] // load a[2]
fadd        st,st(1) // increment a[2]
fstp        dword ptr [esp+28h]  // save a[2]
fadd        dword ptr [esp+2Ch] // increment a[3]
fstp        dword ptr [esp+2Ch] // save a[3]

In terms of memory access, the increment requires:就memory接入而言,增量需要:

SSE                  FPU
4xfloat write        1xfloat read
1xsse read           1xfloat write
1xsse read+add       1xfloat read
1xsse write          1xfloat write
                     1xfloat read
                     1xfloat write
                     1xfloat read
                     1xfloat write

total
8 float reads        4 float reads
8 float writes       4 float writes

This shows the SSE is using twice the memory bandwidth of the FPU version and memory bandwidth is a major bottleneck.这表明 SSE 正在使用 FPU 版本的两倍 memory 带宽,而 memory 带宽是一个主要瓶颈。

If you want to seriously maximise the SSE then you need to write the whole aglorithm in a single SSE assembler function so that you can eliminate the memory read/writes as much as possible.如果你想认真地最大化 SSE,那么你需要在一个 SSE 汇编器 function 中编写整个算法,这样你就可以尽可能地消除 memory 读/写。 Using the intrinsics is not an ideal solution for optimisation.使用内在函数不是优化的理想解决方案。

here is an another example(Mandelbrot Sets) which is almost same to mine way of implementation of the Julia set algoritm http://pastebin.com/J90paPVC based on http://www.iquilezles.org/www/articles/sse/sse.htm .这是另一个示例(Mandelbrot 集),它与我的 Julia 集算法的实现方式几乎相同http://pastebin.com/J90paPVC基于http://www.iquilezles.org/www/articles/sse/ sse.htm same story FPU>SSE I even skip some irrelevant operations.同样的故事 FPU>SSE 我什至跳过了一些不相关的操作。 any ideas how to do it right?任何想法如何做对?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM