简体   繁体   English

LAPACK / BLAS与简单的“for”循环

[英]LAPACK/BLAS versus simple “for” loops

I want to migrate a piece of code that involves a number of vector and matrix calculations to C or C++, the objective being to speed up the code as much as possible. 我想将一段涉及大量向量和矩阵计算的代码迁移到C或C ++,目标是尽可能地加速代码。

Are linear algebra calculations with for loops in C code as fast as calculations using LAPACK/BLAS, or there is some unique speedup from using those libraries? C代码中的for循环的线性代数计算与使用LAPACK / BLAS的计算一样快,或者使用这些库有一些独特的加速?

In other words, could simple C code (using for loops and the like) perform linear algebra calculations as fast as code that utilizes LAPACK/BLAS? 换句话说,简单的C代码(使用for循环等)可以像使用LAPACK / BLAS的代码一样快地执行线性代数计算吗?

Vendor-provided LAPACK / BLAS libraries (Intel's IPP/MKL have been mentioned, but there's also AMD's ACML, and other CPU vendors like IBM/Power or Oracle/SPARC provide equivalents as well) are often highly optimized for specific CPU abilities that'll significantly boost performance on large datasets. 供应商提供的LAPACK / BLAS库(英特尔的IPP / MKL已被提及,但也有AMD的ACML,其他CPU供应商,如IBM / Power或Oracle / SPARC也提供等价物)通常针对特定的CPU能力进行高度优化。显着提高大型数据集的性能。

Often, though, you've got very specific small data to operate on (say, 4x4 matrices or 4D dot products, ie operations used in 3D geometry processing) and for those sort of things, BLAS/LAPACK are overkill, because of initial tests done by these subroutines which codepaths to choose, depending on properties of the data set. 但是,通常,你有非常特定的小数据可以操作(例如,4x4矩阵或4D点积,即3D几何处理中使用的操作),对于那些类型的东西,BLAS / LAPACK是过度的,因为初始测试这些子程序由哪些代码路径选择,取决于数据集的属性。 In those situations, simple C/C++ sourcecode, maybe using SSE2...4 intrinsics and/or compiler-generated vectorization, may beat BLAS/LAPACK. 在这些情况下,简单的C / C ++源代码,可能使用SSE2 ... 4内在函数和/或编译器生成的向量化,可能会击败BLAS / LAPACK。
That's why, for example, Intel has two libraries - MKL for large linear algebra datasets, and IPP for small (graphics vectors) data sets. 这就是为什么,例如,英特尔有两个库 - 用于大型线性代数数据集的MKL和用于小型 (图形矢量)数据集的IPP。

In that sense, 从这个意义上讲,

  • what exactly is your data set ? 你的数据集究竟是什么?
  • What matrix/vector sizes ? 什么矩阵/矢量大小?
  • What linear algebra operations ? 什么线性代数运算?

Also, regarding "simple for loops": Give the compiler the chance to vectorize for you. 另外,关于“simple for loops”:给编译器提供向量化的机会。 Ie something like: 就像这样:

for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) {
    vecmul[i] = src1[i] * src2[i];
    vecmul[i+1] = src1[i+1] * src2[i+1];
    vecmul[i+2] = src1[i+2] * src2[i+2];
    vecmul[i+3] = src1[i+3] * src2[i+3];
}
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4)
    dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];

might be a better feed to a vectorizing compiler than the plain 对于矢量化编译器而言,它可能比普通的更好

for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];

expression. 表达。 In some ways, what you mean by calculations with for loops will have a significant impact. 在某些方面,通过for循环计算的意思将产生重大影响。
If your vector dimensions are large enough though, the BLAS version, 如果您的矢量尺寸足够大,那么BLAS版本,

dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);

will be cleaner code and likely faster. 将是更清洁的代码,可能更快。

On the reference side, these might be of interest: 在参考方面,这些可能是有意义的:

Probably not. 可能不是。 People quite a bit of work into ensuring that lapack/BLAS routines are optimized and numerically stable. 人们在确保lapack / BLAS例程得到优化和数值稳定方面做了大量工作。 While the code is often somewhat on the complex side, it's usually that way for a reason. 虽然代码通常有点复杂,但通常这种方式是有原因的。

Depending on your intended target(s), you might want to look at the Intel Math Kernel Library . 根据您的预期目标,您可能需要查看英特尔数学核心库 At least if you're targeting Intel processors, it's probably the fastest you're going to find. 至少如果您的目标是英特尔处理器,它可能是您要找到的最快的处理器。

Numerical analysis is hard. 数值分析很难。 At the very least, you need to be intimately aware of the limitations of floating point arithmetic, and know how to sequence operations so that you balance speed with numerical stability. 至少,您需要密切了解浮点运算的局限性,并了解如何对运算进行排序,以便平衡速度和数值稳定性。 This is non-trivial. 这不重要。

You need to actually have some clue about the balance between speed and stability you actually need. 你需要真正了解一下你实际需要的速度和稳定性之间的平衡。 In more general software development, premature optimization is the root of all evil. 在更一般的软件开发中,过早优化是万恶之源。 In numerical analysis, it is the name of the game. 在数值分析中,它是游戏的名称。 If you don't get the balance right the first time, you will have to re-write more-or-less all of it. 如果你第一次没有得到正确的平衡,你将不得不重新写下所有的全部。

And it gets harder when you try to adapt linear algebra proofs into algorithms. 当你尝试将线性代数证明适应算法时,它会变得更难。 You need to actually understand the algebra, so that you can refactor it into a stable (or stable enough) algorithm. 您需要实际理解代数,以便可以将其重构为稳定(或足够稳定)的算法。

If I were you, I'd target the LAPACK/BLAS API and shop around for the library that works for your data set. 如果我是你,我将针对LAPACK / BLAS API并四处寻找适用于您的数据集的库。

You have plenty of options: LAPACK/BLAS, GSL and other self-optimizing libraries, vender libraries. 您有很多选择:LAPACK / BLAS,GSL和其他自我优化库,供应商库。

I dont meet this libraries very well. 我不太满足这个库。 But you should consider that libraries usually make a couple of tests in parameters, they have a "sistem of comunication" to errors, and even the attribution to new variables when you call a function... If the calcs are trivial, maybe you can try do it by yourself, adaptating whith your necessities... 但你应该考虑到库通常会在参数中进行一些测试,它们对错误有一个“通信系统”,甚至在调用函数时归因于新变量...如果计算是微不足道的,也许你可以尝试自己做,适应你的必需品......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM