简体   繁体   English

如何使用SSE指令集绝对2双或4浮点数? (截至SSE4)

[英]How to absolute 2 double or 4 floats using SSE instruction set? (Up to SSE4)

Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles. 这是我尝试使用SSE加速的示例C代码,两个数组是3072元素长的双精度数,如果我不需要双精度,可以将其下放到浮点数。

double sum = 0.0;

for(k = 0; k < 3072; k++) {
    sum += fabs(sima[k] - simb[k]);
}

double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));

Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop. 无论如何,我目前的问题是如何在SSE寄存器中执行fabs步骤为double或float,以便我可以将整个计算保留在SSE寄存器中,以便它保持快速,并且我可以通过部分展开此循环来并行化所有步骤。

Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check. 这里有一些资源我发现了fabs()asm或者可能会翻转这个标志 -但是第二个的弱点需要有条件的检查。

I suggest using bitwise and with a mask. 我建议使用按位和掩码。 Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format . 正值和负值具有相同的表示,只有最高有效位不同,正值为0,负值为1,请参见双精度数格式 You can use one of these: 您可以使用以下方法之一:

inline __m128 abs_ps(__m128 x) {
    static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
    return _mm_andnot_ps(sign_mask, x);
}

inline __m128d abs_pd(__m128d x) {
    static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
    return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}

Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. 此外,展开循环以打破循环携带的依赖关系链可能是个好主意。 Since this is a sum of nonnegative values, the order of summation is not important: 由于这是非负值的总和,因此求和的顺序并不重要:

double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;

__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
    sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
    sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}

__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}

By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. 通过展开和断开依赖关系(sum1和sum2现在是独立的),您让处理器执行我们的顺序添加。 Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. 由于指令是在现代CPU上流水线化的,因此CPU可以在前一个完成之前开始处理新的添加。 Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. 此外,按位操作在单独的执行单元上执行,CPU实际上可以在与加/减相同的周期中执行它。 I suggest Agner Fog's optimization manuals . 我建议使用Agner Fog的优化手册

Finally, I don't recommend using openMP. 最后,我不建议使用openMP。 The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit. 循环太小,在多个线程之间分配作业的开销可能比任何潜在的好处都要大。

The maximum of -x and x should be abs(x). -x和x的最大值应为abs(x)。 Here it is in code: 这是代码:

x = _mm_max_ps(_mm_sub_ps(_mm_setzero_ps(), x), x)

Probably the easiest way is as follows: 可能最简单的方法如下:

__m128d vsum = _mm_set1_pd(0.0);        // init partial sums
for (k = 0; k < 3072; k += 2)
{
    __m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
    __m128d vb = _mm_load_pd(&simb[k]);
    __m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
    __m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
    __m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff);        // calc abs diff = max(diff, - diff)
    vsum = _mm_add_pd(vsum, vabsdiff);  // accumulate two partial sums
}

Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. 请注意,这可能不会比现代x86 CPU上的标量代码更快,后者通常有两个FPU。 However if you can drop down to single precision then you may well get a 2x throughput improvement. 但是,如果您可以降低到单精度,那么您的吞吐量可能会提高2倍。

Note also that you will need to combine the two partial sums in vsum into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical. 另请注意,在循环之后,您需要将vsum的两个部分和vsum成标量值,但这非常简单,并且不是性能关键。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM