使用 SSE Intrinsics 在浮点 x、y、z 数组上向量化循环计算长度和差异

Question

I am trying to convert a loop I have into a SSE intrinsics.我正在尝试将我拥有的循环转换为 SSE 内在函数。 I seem to have made fairly good progress, and by that I mean It's in the correct direction however I appear to have done some of the translation wrong somewhere as I am not getting the same "correct" answer which results from the non-sse code.我似乎取得了相当好的进展，我的意思是它的方向是正确的，但是我似乎在某处做了一些错误的翻译，因为我没有得到非 sse 代码导致的相同“正确”答案.

My original loop which I unrolled by a factor of 4 looks like this:我以 4 倍展开的原始循环如下所示：

int unroll_n = (N/4)*4;

for (int j = 0; j < unroll_n; j++) {
        for (int i = 0; i < unroll_n; i+=4) {
            float rx = x[j] - x[i];
            float ry = y[j] - y[i];
            float rz = z[j] - z[i];
            float r2 = rx*rx + ry*ry + rz*rz + eps;
            float r2inv = 1.0f / sqrt(r2);
            float r6inv = r2inv * r2inv * r2inv;
            float s = m[j] * r6inv;
            ax[i] += s * rx;
            ay[i] += s * ry;
            az[i] += s * rz;
            //u
            rx = x[j] - x[i+1];
             ry = y[j] - y[i+1];
             rz = z[j] - z[i+1];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+1] += s * rx;
            ay[i+1] += s * ry;
            az[i+1] += s * rz;
            //unroll i 3
             rx = x[j] - x[i+2];
             ry = y[j] - y[i+2];
             rz = z[j] - z[i+2];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+2] += s * rx;
            ay[i+2] += s * ry;
            az[i+2] += s * rz;
            //unroll i 4
             rx = x[j] - x[i+3];
             ry = y[j] - y[i+3];
             rz = z[j] - z[i+3];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+3] += s * rx;
            ay[i+3] += s * ry;
            az[i+3] += s * rz;
    }
}

I essentially then went line by line for the top section and converted it into SSE intrinsics.我基本上然后逐行查找顶部并将其转换为 SSE 内在函数。 The code is below.代码如下。 I'm not totally sure if the top three lines are needed however I understand that my data needs to be 16bit aligned for this to work correctly and optimally.我不完全确定是否需要前三行，但是我知道我的数据需要 16 位对齐才能正确和最佳地工作。

float *x = malloc(sizeof(float) * N);
float *y = malloc(sizeof(float) * N);
float *z = malloc(sizeof(float) * N); 

for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i+=4) {
        __m128 xj_v = _mm_set1_ps(x[j]);
        __m128 xi_v = _mm_load_ps(&x[i]);
        __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

        __m128 yj_v = _mm_set1_ps(y[j]);
        __m128 yi_v = _mm_load_ps(&y[i]);
        __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

        __m128 zj_v = _mm_set1_ps(z[j]);
        __m128 zi_v = _mm_load_ps(&z[i]);
        __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

    __m128 r2_v = _mm_mul_ps(rx_v, rx_v) + _mm_mul_ps(ry_v, ry_v) + _mm_mul_ps(rz_v, rz_v) + _mm_set1_ps(eps);

    __m128 r2inv_v = _mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(r2_v));

    __m128 r6inv_1v = _mm_mul_ps(r2inv_v, r2inv_v);
    __m128 r6inv_v = _mm_mul_ps(r6inv_1v, r2inv_v);

    __m128 mj_v = _mm_set1_ps(m[j]);
    __m128 s_v = _mm_mul_ps(mj_v, r6inv_v);

    __m128 axi_v = _mm_load_ps(&ax[i]);
    __m128 ayi_v = _mm_load_ps(&ay[i]);
    __m128 azi_v = _mm_load_ps(&az[i]);

    __m128 srx_v = _mm_mul_ps(s_v, rx_v);
    __m128 sry_v = _mm_mul_ps(s_v, ry_v);
    __m128 srz_v = _mm_mul_ps(s_v, rz_v);

    axi_v = _mm_add_ps(axi_v, srx_v);
    ayi_v = _mm_add_ps(ayi_v, srx_v);
    azi_v = _mm_add_ps(azi_v, srx_v);

    _mm_store_ps(ax, axi_v);
    _mm_store_ps(ay, ayi_v);
    _mm_store_ps(az, azi_v);
    }
}

I feel the main idea is correct however there is an/some error(s) somewhere as the resulting answer is incorrect.我觉得主要思想是正确的，但是由于结果答案不正确，某处存在/一些错误。

Answer 1

I think your only bugs are simple typos, not logic errors, see below.我认为你唯一的错误是简单的错别字，而不是逻辑错误，见下文。

Can't you just use clang 's auto-vectorization?你不能只使用clang的自动矢量化吗？ Or do you need to use gcc for this code?或者你需要为这段代码使用 gcc 吗？ Auto-vectorization would let you make SSE, AVX, and (in future) AVX512 versions from the same source with no modifications.自动矢量化可让您从同一来源制作 SSE、AVX 和（未来）AVX512 版本，无需修改。 Intrinsics aren't scalable to different vector sizes, unfortunately.不幸的是，内部函数不能扩展到不同的向量大小。

Based on your start at vectorizing, I made an optimized version.根据您对矢量化的开始，我制作了一个优化版本。 You should try it out, I'm curious to hear if it's faster than your version with bugfixes, or clang's auto-vectorized version.您应该尝试一下，我很想知道它是否比带有错误修复的版本或 clang 的自动矢量化版本更快。 :) :)

This looks wrong:这看起来不对：

_mm_store_ps(ax, axi_v);
_mm_store_ps(ay, ayi_v);
_mm_store_ps(az, azi_v);

You loaded from ax[i] , but now you're storing to ax[0] .您从ax[i]加载，但现在您正在存储到ax[0] 。

Also, clang's unused-variable warning found this bug:另外， clang 的未使用变量警告发现了这个错误：

axi_v = _mm_add_ps(axi_v, srx_v);
ayi_v = _mm_add_ps(ayi_v, srx_v);  // should be sry_v
azi_v = _mm_add_ps(azi_v, srx_v);  // should be srz_v

Like I said in my answer on your previous question, you should probably interchange the loops, so the same ax[i+0..3], ay[i+0..3], and az[i+0..3] are used, avoiding that load/store.就像我在上一个问题的回答中所说的那样，您可能应该交换循环，因此相同的 ax[i+0..3]、ay[i+0..3] 和 az[i+0..3 ] 使用，避免加载/存储。

Also, if you're not going to use rsqrtps + Newton-Raphson , you should use the transformation I pointed out in my last answer: divide m[j] by sqrt(k2) ^3 .另外，如果您不打算使用rsqrtps + Newton-Raphson ，则应该使用我在上一个答案中指出的转换：将m[j]除以sqrt(k2) ^3 。 There's no point dividing 1.0f by something using a divps , and then later multiplying only once.使用divps将1.0f除以某物，然后仅乘以一次是没有意义的。

rsqrt might not actually be a win, because total uop throughput might be more of a bottleneck than div / sqrt throughput or latency. rsqrt实际上可能不是一个胜利，因为总 uop 吞吐量可能比 div / sqrt 吞吐量或延迟更成为瓶颈。 three multiplies + FMA + rsqrtps is significantly more uops than sqrtps + divps. 三乘法 + FMA + rsqrtps明显多于 sqrtps + divps。 rsqrt is more helpful with AVX 256b vectors, because the divide / sqrt unit isn't full-width on SnB to Broadwell. rsqrt对 AVX 256b 向量更有帮助，因为在 SnB 到 Broadwell 上，divide/sqrt 单元不是全角的。 Skylake has 12c latency sqrtps ymm , same as for xmm, but throughput is still better for xmm (one per 3c instead of one per 6c). Skylake 具有 12c 延迟sqrtps ymm ，与 xmm 相同，但 xmm 的吞吐量仍然更好（每 3c 一个，而不是每 6c 一个）。

clang and gcc were both using rsqrtps / rsqrtss when compiling your code with -ffast-math . 当使用-ffast-math编译代码时， clang 和 gcc 都使用rsqrtps / rsqrtss 。 (only clang using the packed version, of course.) （当然，只有使用打包版本的 clang。）

If you don't interchange the loops, you should manually hoist everything that only depends on j out of the inner loop.如果不交换循环，则应手动将仅依赖于j从内循环中提升。 Compilers are generally good at this, but it still seems like a good idea to make the source reflect what you expect the compiler to be able to do.编译器通常擅长于此，但让源代码反映您期望编译器能够做什么似乎仍然是个好主意。 This helps with "seeing" what the loop is actually doing.这有助于“看到”循环实际在做什么。

Here's a version with some optimizations over your original:这是一个对原始版本进行了一些优化的版本：

To get gcc/clang to fuse the mul/adds into FMA, I used -ffp-contract=fast .为了让 gcc/clang 将 mul/adds 融合到 FMA 中，我使用了-ffp-contract=fast 。 This gets FMA instructions for high throughput without using -ffast-math .这会在不使用-ffast-math情况下获得高吞吐量的 FMA 指令。 (There is a lot of parallelism with the three separate accumulators, so the increased latency of FMA compared to addps shouldn't hurt at all. I expect port0/1 throughput is the limiting factor here.) I thought gcc did this automatically , but it seems it doesn't here without -ffast-math . （三个单独的累加器有很多并行性，因此与addps相比增加的 FMA 延迟addps不应该受到伤害。我预计 port0/1 吞吐量是这里的限制因素。）我认为 gcc 会自动执行此操作，但是似乎没有-ffast-math就没有这里。
Notice that v ^3/2 = sqrt(v) ³ = sqrt(v)*v.注意 v ^3/2 = sqrt(v) ³ = sqrt(v)*v。 This has lower latency and fewer instructions.这具有更低的延迟和更少的指令。
Interchanged the loops, and use broadcast-loads in the inner loop to improve locality (cut bandwidth requirement by 4, or 8 with AVX).交换循环，并在内循环中使用广播负载来提高局部性（将带宽需求减少 4，或使用 AVX 减少 8）。 Each iteration of the inner loop only reads 4B of new data from each of the four source arrays.内循环的每次迭代仅从四个源数组中的每一个读取 4B 的新数据。 (x,y,z, and m). （x、y、z 和 m）。 So it makes a lot of use of each cache line while it's hot.所以当它很热时，它会大量使用每个缓存线。
Using broadcast-loads in the inner loop also means we accumulate ax[i + 0..3] in parallel, avoiding the need for a horizontal sum, which takes extra code .在内循环中使用广播负载也意味着我们并行地累加ax[i + 0..3] ，避免了需要水平和的需要，这需要额外的代码。 (See a previous version of this answer for code with the loops interchanged, but that used vector loads in the inner loop, with stride = 16B .) （有关循环互换的代码，请参阅此答案的先前版本，但在内循环中使用了向量加载， stride = 16B 。）

It compiles nicely for Haswell with gcc, using FMA .它使用 FMA 为带有 gcc 的 Haswell编译得很好。 (Still only 128b vector size though, unlike clang's auto-vectorized 256b version). （尽管仍然只有 128b 向量大小，这与 clang 的自动向量化 256b 版本不同）。 The inner loop is only 20 instructions, and only 13 of those are FPU ALU instructions that need port 0/1 on Intel SnB-family.内循环只有 20 条指令，其中只有 13 条是需要英特尔 SnB 系列上的端口 0/1 的 FPU ALU 指令。 It makes decent code even with baseline SSE2: no FMA, and needs shufps for the broadcast-loads, but those don't compete for execution units with add/mul.即使使用基线 SSE2，它也能生成不错的代码：没有 FMA，并且需要 shufps 来广播负载，但那些不会与 add/mul 竞争执行单元。

#include <immintrin.h>

void ffunc_sse128(float *restrict ax, float *restrict ay, float *restrict az,
           const float *x, const float *y, const float *z,
           int N, float eps, const float *restrict m)
{
  for (int i = 0; i < N; i+=4) {
    __m128 xi_v = _mm_load_ps(&x[i]);
    __m128 yi_v = _mm_load_ps(&y[i]);
    __m128 zi_v = _mm_load_ps(&z[i]);

    // vector accumulators for ax[i + 0..3] etc.
    __m128 axi_v = _mm_setzero_ps();
    __m128 ayi_v = _mm_setzero_ps();
    __m128 azi_v = _mm_setzero_ps();
    
    // AVX broadcast-loads are as cheap as normal loads,
    // and data-reuse meant that stand-alone load instructions were used anyway.
    // so we're not even losing out on folding loads into other insns
    // An inner-loop stride of only 4B is a huge win if memory / cache bandwidth is a bottleneck
    // even without AVX, the shufps instructions are cheap,
    // and don't compete with add/mul for execution units on Intel
    
    for (int j = 0; j < N; j++) {
      __m128 xj_v = _mm_set1_ps(x[j]);
      __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

      __m128 yj_v = _mm_set1_ps(y[j]);
      __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

      __m128 zj_v = _mm_set1_ps(z[j]);
      __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

      __m128 mj_v = _mm_set1_ps(m[j]);   // do the load early

      // sum of squared differences
      __m128 r2_v = _mm_set1_ps(eps) + rx_v*rx_v + ry_v*ry_v + rz_v*rz_v;   // GNU extension
      /* __m128 r2_v = _mm_add_ps(_mm_set1_ps(eps), _mm_mul_ps(rx_v, rx_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(ry_v, ry_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(rz_v, rz_v));
      */

      // rsqrt and a Newton-Raphson iteration might have lower latency
      // but there's enough surrounding instructions and cross-iteration parallelism
      // that the single-uop sqrtps and divps instructions prob. aren't be a bottleneck
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);

      __m128 srx_v = _mm_mul_ps(s_v, rx_v);
      __m128 sry_v = _mm_mul_ps(s_v, ry_v);
      __m128 srz_v = _mm_mul_ps(s_v, rz_v);

      axi_v = _mm_add_ps(axi_v, srx_v);
      ayi_v = _mm_add_ps(ayi_v, sry_v);
      azi_v = _mm_add_ps(azi_v, srz_v);
    }
    _mm_store_ps(&ax[i], axi_v);
    _mm_store_ps(&ay[i], ayi_v);
    _mm_store_ps(&az[i], azi_v);
  }
}

I also tried a version with rcpps , but IDK if it will be faster.我还尝试了一个带有 rcpps 的版本，但 IDK 是否会更快。 Note that with -ffast-math , gcc and clang will convert the division into rcpps + a Newton iteration.请注意，使用-ffast-math ，gcc 和 clang 会将除法转换为rcpps + 牛顿迭代。 (They for some reason don't convert 1.0f/sqrtf(x) into rsqrt + Newton, even in a stand-alone function). （出于某种原因，即使在独立函数中，他们也不会将1.0f/sqrtf(x)转换为rsqrt + Newton）。 clang does a better job, using FMA for the iteration step. clang 做得更好，在迭代步骤中使用 FMA。

#define USE_RSQRT
#ifndef USE_RSQRT
      // even with -mrecip=vec-sqrt after -ffast-math, this still does sqrt(v)*v, then rcpps
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);
#else
      __m128 r2isqrt = rsqrt_float4_single(r2_v);
      // can't use the sqrt(v)*v trick, unless we either do normal sqrt first then rcpps
      // or rsqrtps and rcpps.  Maybe it's possible to do a Netwon Raphson iteration on that product
      // instead of refining them both separately?
      __m128 r6isqrt = r2isqrt * r2isqrt * r2isqrt;
      __m128 s_v = _mm_mul_ps(mj_v, r6isqrt);
#endif

使用 SSE Intrinsics 在浮点 x、y、z 数组上向量化循环计算长度和差异

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-03-02 01:55:45

This looks wrong:这看起来不对：

使用 SSE Intrinsics 在浮点 x、y、z 数组上向量化循环计算长度和差异

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-03-02 01:55:45

This looks wrong:这看起来不对：

解决方案1
2 已采纳 2016-03-02 01:55:45