性能AVX / SSE汇编与内在函数

Question

I'm just trying to check the optimum approach to optimizing some basic routines. 我只是在尝试检查优化一些基本例程的最佳方法。 In this case I tried very simply example of multiplying 2 float vectors together: 在这种情况下，我尝试了一个非常简单的将两个float向量相乘的示例：

void Mul(float *src1, float *src2, float *dst)
{
    for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
};

Plain C implementation is very slow. 普通C的实现非常慢。 I did some external ASM using AVX and also tried using intrinsics. 我使用AVX做了一些外部ASM，还尝试使用内部函数。 These are the test results (time, smaller is better): 这些是测试结果（时间越小越好）：

ASM: 0.110
IPP: 0.125
Intrinsics: 0.18
Plain C++: 4.0

(compiled using MSVC 2013, SSE2, tried Intel Compiler, results were pretty much the same) （使用MSVC 2013，SSE2编译，尝试使用Intel Compiler，结果几乎相同）

As you can see my ASM code beaten even Intel Performance Primitives (probably because I did lots of branches to ensure I can use the AVX aligned instructions). 如您所见，我的ASM代码甚至击败了Intel Performance Primitives（可能是因为我做了很多分支工作，以确保可以使用AVX对齐的指令）。 But I'd personally like to utilize the intrinsic approach, it's simply easier to manage and I was thinking the compiler should do the best job optimizing all the branches and stuff (my ASM code sucks in that matter imho, yet it is faster). 但是我个人想利用内在方法，它更容易管理，并且我认为编译器应该在优化所有分支和内容上做得最好（我的ASM代码很糟糕，但是它更快）。 So here's the code using intrinsics: 所以这是使用内在函数的代码：

    int i;
    for (i=0; (MINTEGER)(dst + i) % 32 != 0 && i < cnt; i++) dst[i] = src1[i] * src2[i];

    if ((MINTEGER)(src1 + i) % 32 == 0)
    {
        if ((MINTEGER)(src2 + i) % 32 == 0)
        {
            for (; i<cnt-8; i+=8)
            {
                __m256 x = _mm256_load_ps( src1 + i); 
                __m256 y = _mm256_load_ps( src2 + i); 
                __m256 z = _mm256_mul_ps(x, y); 
                _mm256_store_ps(dst + i, z);
            };
        }
        else
        {
            for (; i<cnt-8; i+=8)
            {
                __m256 x = _mm256_load_ps( src1 + i); 
                __m256 y = _mm256_loadu_ps( src2 + i); 
                __m256 z = _mm256_mul_ps(x, y); 
                _mm256_store_ps(dst + i, z);
            };
        };
    }
    else
    {
        for (; i<cnt-8; i+=8)
        {
            __m256 x = _mm256_loadu_ps( src1 + i); 
            __m256 y = _mm256_loadu_ps( src2 + i); 
            __m256 z = _mm256_mul_ps(x, y); 
            _mm256_store_ps(dst + i, z);
        };
    };

    for (; i<cnt; i++) dst[i] = src1[i] * src2[i];

Simple: First get to an address where dst is aligned to 32 bytes, then branch to check which sources are aligned. 简单：首先到达dst对齐到32个字节的地址，然后跳转到检查哪些源被对齐。

One problem is that the C++ implementations in the beginning and at the end are not using AVX unless I enable AVX in the compiler, which I do NOT want, because this should be just AVX specialization, but the software should work even on a platform, where AVX is not available. 一个问题是，除非我在编译器中启用了AVX，否则开头和结尾的C ++实现都不会使用AVX，这是我不希望的，因为这应该只是AVX的专业化，但是该软件甚至可以在平台上运行， AVX不可用的地方。 And sadly there seems to be no intrinsics for instructions such as vmovss, so there's probably a penalty for mixing AVX code with SSE, which the compiler uses. 遗憾的是，似乎没有诸如vmovss之类的指令的内在函数，因此将AVX代码与SSE混合使用可能会受到惩罚，编译器会使用SSE。 However even if I enabled AVX in the compiler, it still didn't get below 0.14. 但是，即使我在编译器中启用了AVX，它仍然不会低于0.14。

Any ideas how to optimize this to make the instrisics reach the speed of the ASM code? 有什么想法如何优化它以使本征达到ASM代码的速度？

Answer 1

Your implementation with intrinsics is not the same function as your implementation in straight C: eg what if your function was called with arguments Mul(p, p, p+1) ? 您使用内在函数的实现与您在直接C语言中的实现不同：例如，如果使用参数Mul(p, p, p+1)调用函数Mul(p, p, p+1)怎么办？ You'll get different results. 您会得到不同的结果。 The pure C version is slow because the compiler is ensuring that the code does exactly what you said. 纯C版本的速度很慢，因为编译器正在确保代码完全符合您的要求。

If you want the compiler to make optimizations based on the assumption that the three arrays do not overlap, you need to make that explicit: 如果希望编译器基于三个数组不重叠的假设进行优化，则需要明确说明：

void Mul(float *src1, float *src2, float *__restrict__ dst)

or even better 甚至更好

void Mul(const float *src1, const float *src2, float *__restrict__ dst)

(I think it's enough to have __restrict__ just on the output pointer, although it wouldn't hurt to add it to the input pointers too) （我认为仅在输出指针上使用__restrict__就足够了，尽管也可以将其添加到输入指针上也没有什么坏处）

Answer 2

On CPUs with AVX there is very little penalty for using misaligned loads - I would suggest trading this small penalty off against all the extra logic you're using to check for alignment etc and just have a single loop + scalar code to handle any residual elements: 在具有AVX的CPU上，使用未对齐的负载几乎不会带来任何损失-我建议您将此小额损失与用于检查对齐等的所有额外逻辑进行权衡，并且只使用一个循环+标量代码来处理任何剩余元素：

   for (i = 0; i <= cnt - 8; i += 8)
   {
        __m256 x = _mm256_loadu_ps(src1 + i); 
        __m256 y = _mm256_loadu_ps(src2 + i); 
        __m256 z = _mm256_mul_ps(x, y); 
        _mm256_storeu_ps(dst + i, z);
   }
   for ( ; i < cnt; i++)
   {
       dst[i] = src1[i] * src2[i];
   }

Better still, make sure that your buffers are all 32 byte aligned in the first place and then just use aligned loads/stores. 更好的是，首先确保所有缓冲区都是32字节对齐的，然后才使用对齐的加载/存储。

Note that performing a single arithmetic operation in a loop like this is generally a bad approach with SIMD - execution time will be largely dominated by loads and stores - you should try to combine this multiplication with other SIMD operations to mitigate the load/store cost. 请注意，在这样的循环中执行单个算术运算通常对于SIMD来说是一种不好的方法-执行时间将主要由加载和存储决定-您应尝试将此乘法与其他SIMD运算结合起来以减轻加载/存储成本。

性能AVX / SSE汇编与内在函数

问题描述

2 个解决方案

解决方案1
3 2015-03-17 07:32:22

解决方案2
1 2015-03-17 07:14:08

性能AVX / SSE汇编与内在函数

问题描述

2 个解决方案

解决方案1 3 2015-03-17 07:32:22

解决方案2 1 2015-03-17 07:14:08

解决方案1
3 2015-03-17 07:32:22

解决方案2
1 2015-03-17 07:14:08