使用SIMD范例在256位向量上应用给定函数

Question

Is there a way to evaluate a function along a __m256d/s vector? 有没有办法评估__m256d/s向量的函数？ Like this: 像这样：

#include <immintrin.h>

inline __m256d func(__m256d *a, __m256d *b)
{
    return 1 / ((*a + *b) * (*a + *b));
}

int main()
{
    __m256d a = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
    __m256d b = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
    __m256d c = func(a, b);

    return 0;
}

I would like to evaluate any given mathematical function using the SIMD paradigm. 我想使用SIMD范例评估任何给定的数学函数。 If this isn't possible, wouldn't this be the biggest limitation of SIMD programming Vs GPGPU? 如果这不可能，这不是SIMD编程Vs GPGPU的最大限制吗？ I mean I've realized that the compute power in terms of FLOPS of CPUs is getting closer to GPUs, some comparsions: 我的意思是我已经意识到CPU的FLOPS计算能力越来越接近GPU，一些比较：

Nvidia Quadro K6000 ~ 5196 GFLOPS Nvidia Quadro K6000~5196 GFLOPS
Nvidia Quadro K5000 ~ 2169 GFLOPS Nvidia Quadro K5000~2169 GFLOPS
Intel Xeon E5-2699 v3 ~ 1728 GFLOPS (18 cores * 32 FLOP/cycle * 3 Ghz) Intel Xeon E5-2699 v3~1728 GFLOPS（18核* 32 FLOP /周期* 3 Ghz）

Future guesses: 未来的猜测：

AVX-512 and probable 20 cores Xeon CPUs 3840 GLOPS (20 cores * 64 FLOP/cycle * 3 Ghz) AVX-512和可能的20核Xeon CPUs 3840 GLOPS（20核* 64 FLOP /周期* 3 Ghz）
Knights Landing 5907 GFLOPS (71 cores * 64 FLOP/cycle * 1.3 Ghz) 骑士降落5907 GFLOPS（71芯* 64 FLOP /周期* 1.3 Ghz）

Answer 1

Your question is very interesting. 你的问题非常有趣。 What you are describing cannot be done using existing compilers. 您使用现有编译器无法完成所描述的内容。 If you overwrite your basic operators handling the 256b vectors you might be able to get close to your desired functionality. 如果覆盖处理256b向量的基本运算符，您可能能够接近所需的功能。

However I would not say that this is the biggest limitation of SIMD programming vs GPGPU . 但是我不会说这是SIMD编程与GPGPU的最大限制 。 The main advantage of GPGPU is FLOPS count but this comes at some cost. GPGPU的主要优点是FLOPS计数，但这需要一些成本。 One is that GPGPUs don't handle branches very well, don't do well with threads dealing with large local data, etc. Another limitation is that the GPGPU programming model is rather complex, compared with traditional coding. 一个是GPGPU不能很好地处理分支，不能处理处理大型本地数据的线程等。另一个限制是与传统编码相比，GPGPU编程模型相当复杂。

On a CPU you can run more general codes and the compiler will vectorize most of the times, without asking the programmer to write specific intrinsics. 在CPU上，您可以运行更多通用代码，编译器将在大多数情况下进行向量化，而无需要求程序员编写特定的内部函数。

So I'd go further and say that simple code is actually an advantage for CPUs . 所以我会进一步说， 简单的代码实际上是CPU的优势 。 Consider the amount of effort necessary to port 20 years FORTRAN software to a GPGPU. 考虑将20年FORTRAN软件移植到GPGPU所需的工作量。 While if you have a good compiler, and a good CPU (with good FLOP count), you might get expected performance. 如果你有一个好的编译器和一个好的CPU（具有良好的FLOP计数），你可能会获得预期的性能。

使用SIMD范例在256位向量上应用给定函数

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-11-11 21:29:21

使用SIMD范例在256位向量上应用给定函数

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-11-11 21:29:21

解决方案1
2 已采纳 2014-11-11 21:29:21