简体   繁体   English

使用SIMD范例在256位向量上应用给定函数

[英]Apply a given function on a 256 bit vector using SIMD paradigm

Is there a way to evaluate a function along a __m256d/s vector? 有没有办法评估__m256d/s向量的函数? Like this: 像这样:

#include <immintrin.h>

inline __m256d func(__m256d *a, __m256d *b)
{
    return 1 / ((*a + *b) * (*a + *b));
}

int main()
{
    __m256d a = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
    __m256d b = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
    __m256d c = func(a, b);

    return 0;
}

I would like to evaluate any given mathematical function using the SIMD paradigm. 我想使用SIMD范例评估任何给定的数学函数。 If this isn't possible, wouldn't this be the biggest limitation of SIMD programming Vs GPGPU? 如果这不可能,这不是SIMD编程Vs GPGPU的最大限制吗? I mean I've realized that the compute power in terms of FLOPS of CPUs is getting closer to GPUs, some comparsions: 我的意思是我已经意识到CPU的FLOPS计算能力越来越接近GPU,一些比较:

  • Nvidia Quadro K6000 ~ 5196 GFLOPS Nvidia Quadro K6000~5196 GFLOPS
  • Nvidia Quadro K5000 ~ 2169 GFLOPS Nvidia Quadro K5000~2169 GFLOPS
  • Intel Xeon E5-2699 v3 ~ 1728 GFLOPS (18 cores * 32 FLOP/cycle * 3 Ghz) Intel Xeon E5-2699 v3~1728 GFLOPS(18核* 32 FLOP /周期* 3 Ghz)

Future guesses: 未来的猜测:

  • AVX-512 and probable 20 cores Xeon CPUs 3840 GLOPS (20 cores * 64 FLOP/cycle * 3 Ghz) AVX-512和可能的20核Xeon CPUs 3840 GLOPS(20核* 64 FLOP /周期* 3 Ghz)

  • Knights Landing 5907 GFLOPS (71 cores * 64 FLOP/cycle * 1.3 Ghz) 骑士降落5907 GFLOPS(71芯* 64 FLOP /周期* 1.3 Ghz)

Your question is very interesting. 你的问题非常有趣。 What you are describing cannot be done using existing compilers. 您使用现有编译器无法完成所描述的内容。 If you overwrite your basic operators handling the 256b vectors you might be able to get close to your desired functionality. 如果覆盖处理256b向量的基本运算符,您可能能够接近所需的功能。

However I would not say that this is the biggest limitation of SIMD programming vs GPGPU . 但是我不会说这是SIMD编程与GPGPU的最大限制 The main advantage of GPGPU is FLOPS count but this comes at some cost. GPGPU的主要优点是FLOPS计数,但这需要一些成本。 One is that GPGPUs don't handle branches very well, don't do well with threads dealing with large local data, etc. Another limitation is that the GPGPU programming model is rather complex, compared with traditional coding. 一个是GPGPU不能很好地处理分支,不能处理处理大型本地数据的线程等。另一个限制是与传统编码相比,GPGPU编程模型相当复杂。

On a CPU you can run more general codes and the compiler will vectorize most of the times, without asking the programmer to write specific intrinsics. 在CPU上,您可以运行更多通用代码,编译器将在大多数情况下进行向量化,而无需要求程序员编写特定的内部函数。

So I'd go further and say that simple code is actually an advantage for CPUs . 所以我会进一步说, 简单的代码实际上是CPU的优势 Consider the amount of effort necessary to port 20 years FORTRAN software to a GPGPU. 考虑将20年FORTRAN软件移植到GPGPU所需的工作量。 While if you have a good compiler, and a good CPU (with good FLOP count), you might get expected performance. 如果你有一个好的编译器和一个好的CPU(具有良好的FLOP计数),你可能会获得预期的性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 SIMD 根据另一个向量位值计算值的乘积 - Computing a product of values based on another vector bit values using SIMD 使用SIMD指令执行任意128/256/512位置换的最快方法是什么? - What's the fastest way to perform an arbitrary 128/256/512 bit permutation using SIMD instructions? 使用 SIMD 对半字节的去交错向量 - Deinterleve vector of nibbles using SIMD 使用 192/256 位整数求和无符号 64 位整数向量的点积的最快方法? - Fastest way to sum dot product of vector of unsigned 64 bit integers using 192/256 bit integer? 在CUDA中使用SIMD实现位旋转运算符 - Implementation of bit rotate operators using SIMD in CUDA 使用模板函数将转换应用于具有索引的向量 - apply transformation to vector with index using template function 在 C++ 中使用 AltiVec SIMD 向量类型的编译错误 - Compilation error using AltiVec SIMD vector type in C++ 如何将两个 256 位向量的低 3 位元素连接到一个 512 位向量中,并插入一个标量? - How to concatenate the low 3 elements from two 256-bit vectors in a 512-bit vector, and insert a scalar? SIMD与向量乘法中的OMP - SIMD vs OMP in vector multiplication 在 LLVM 中加载 SIMD 向量 memory - SIMD vector memory load in LLVM
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM