简体   繁体   English

使用SSE(x * x * x)+(y * y * y)进行乘法

[英]multiplication using SSE (x*x*x)+(y*y*y)

I'm trying to optimize this function using SIMD but I don't know where to start. 我正在尝试使用SIMD优化此功能,但我不知道从哪里开始。

long sum(int x,int y)
{
    return x*x*x+y*y*y;
}

The disassembled function looks like this: 反汇编的函数如下所示:

  4007a0:   48 89 f2                mov    %rsi,%rdx
  4007a3:   48 89 f8                mov    %rdi,%rax
  4007a6:   48 0f af d6             imul   %rsi,%rdx
  4007aa:   48 0f af c7             imul   %rdi,%rax
  4007ae:   48 0f af d6             imul   %rsi,%rdx
  4007b2:   48 0f af c7             imul   %rdi,%rax
  4007b6:   48 8d 04 02             lea    (%rdx,%rax,1),%rax
  4007ba:   c3                      retq   
  4007bb:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

The calling code looks like this: 调用代码如下所示:

 do {
for (i = 0; i < maxi; i++) {
  j = nextj[i];
  long sum = cubeSum(i,j);
  while (sum <= p) {
    long x = sum & (psize - 1);
    int flag = table[x];
    if (flag <= guard) {
      table[x] = guard+1;
    } else if (flag == guard+1) {
      table[x] = guard+2;
      count++;
    }
    j++;
    sum = cubeSum(i,j);
  }
  nextj[i] = j;
}
p += psize;
guard += 3;
} while (p <= n);
  • Fill one SSE register with (x|y|0|0) (since each SSE register holds 4 32-bit elements). 用(x | y | 0 | 0)填充一个SSE寄存器(因为每个SSE寄存器包含4个32位元素)。 Lets call it r1 让我们称它为r1
  • then make a copy of that register to another register r2 然后将该寄存器的副本复制到另一个寄存器r2
  • Do r2 * r1, storing the result in, say r2. 执行r2 * r1,将结果存储在r2中。
  • Do r2 * r1 again storing the result in r2 再次将r2 * r1存储到r2中
  • Now in r2 you have (x*x*x|y*y*y|0|0) 现在在r2中,您有(x * x * x | y * y * y | 0 | 0)
  • Unpack the lower two elements of r2 into separate registers, add them (SSE3 has horizontal add instructions, but only for floats and doubles). 将r2的下两个元素解压缩到单独的寄存器中,进行添加(SSE3具有水平加法指令,但仅适用于浮点数和双精度数)。

In the end, I'd actually be surprised if this turned out to be any faster than the simple code the compiler has already generated for you. 最后,如果事实证明它比编译器已经为您生成的简单代码快得多,我实际上会感到惊讶。 SIMD is more useful if you have arrays of data you want to operate on.. 如果您要处理数据数组,则SIMD更为有用。

This particular case is not a good fit for SIMD (SSE or otherwise). 这种特殊情况不适用于SIMD(SSE或其他方式)。 SIMD really only works well when you have contiguous arrays that you can access sequentially and process heterogeneously. 仅当您具有可以依次访问并进行异类处理的连续数组时,SIMD才能真正正常工作。

However you can at least get rid of some of the redundant operations in the scalar code, eg repeatedly calculating i * i * i when i is invariant: 但是,您至少可以摆脱标量代码中的一些冗余操作,例如,当i不变时,重复计算i * i * i

do {
    for (i = 0; i < maxi; i++) {
        int i3 = i * i * i;
        int j = nextj[i];
        int j3 = j * j * j;
        long sum = i3 + j3;
        while (sum <= p) {
            long x = sum & (psize - 1);
            int flag = table[x];
            if (flag <= guard) {
              table[x] = guard+1;
            } else if (flag == guard+1) {
              table[x] = guard+2;
              count++;
            }
            j++;
            j3 = j * j * j;
            sum = i3 + j3;
        }
        nextj[i] = j;
    }
    p += psize;
    guard += 3;
} while (p <= n);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM