简体   繁体   English

iOS 4使用4x4矩阵加速Cblas

[英]iOS 4 Accelerate Cblas with 4x4 matrices

I've been looking into the Accelerate framework that was made available in iOS 4. Specifically, I made some attempts to use the Cblas routines in my linear algebra library in C. Now I can't get the use of these functions to give me any performance gain over very basic routines. 我一直在研究在iOS 4中提供的Accelerate框架。具体来说,我尝试在C中的线性代数库中使用Cblas例程。现在我无法使用这些函数给我在非常基本的例程中获得任何性能提升。 Specifically, the case of 4x4 matrix multiplication. 具体来说,是4x4矩阵乘法的情况。 Wherever I couldn't make use of affine or homogeneous properties of the matrices, I've been using this routine (abridged): 无论何时我无法利用矩阵的仿射或同质性质,我一直在使用这个例程(删节):

float *mat4SetMat4Mult(const float *m0, const float *m1, float *target) {
    target[0] = m0[0] * m1[0] + m0[4] * m1[1] + m0[8] * m1[2] + m0[12] * m1[3];
    target[1] = ...etc...
    ...
    target[15] = m0[3] * m1[12] + m0[7] * m1[13] + m0[11] * m1[14] + m0[15] * m1[15];
    return target;
}

The equivalent function call for Cblas is: Cblas的等效函数调用是:

cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
   4, 4, 4, 1.f, m0, 4, m1, 4, 0.f, target, 4);

Comparing the two, by making them run through a large number of pre-computed matrices filled with random numbers (each function gets the exact same input every time), the Cblas routine performs about 4x slower, when timed with the C clock() function. 比较两者,通过使它们运行大量充满随机数的预先计算的矩阵(每个函数每次都获得完全相同的输入),当使用C clock()函数计时时,Cblas例程执行速度大约慢4倍。

This doesn't seem right to me, and I'm left with the feeling that I'm doing something wrong somewhere. 这对我来说似乎不对,而且我感觉我在某处做错了什么。 Do I have to to enable the device's NEON unit and SIMD functionality somehow? 我是否必须以某种方式启用设备的NEON设备和SIMD功能? Or shouldn't I hope for better performance with such small matrices? 或者我不希望用这么小的矩阵获得更好的性能?

Very much appreciated, 非常感谢,

Bastiaan 巴斯蒂安

The Apple WWDC2010 presentations say that Accelerate should still give a speedup for even a 3x3 matrix operation, so I would have assumed you should see a slight improvement for 4x4. Apple WWDC2010的演示文稿表示,即使是3x3矩阵操作,Accelerate仍然应该加速,所以我认为你应该看到4x4略有改进。 But something you need to consider is that Accelerate & NEON are designed to greatly speed up integer operations but not necessarily floating-point operations. 但是你需要考虑的是Accelerate&NEON旨在大大加速整数运算,但不一定是浮点运算。 You didn't mention your CPU processor, and it seems that Accelerate will use either NEON or VFP for floating-point operations depending on your CPU. 您没有提到您的CPU处理器,而且似乎Accelerate将使用NEON或VFP进行浮点运算,具体取决于您的CPU。 If it uses NEON instructions for 32-bit float operations then it should run fast, but if it uses VFP for 32-bit float or 64-bit double operations, then it will run very slowly (since VFP is not actually SIMD). 如果它使用NEON指令进行32位浮点运算,那么它应该运行得很快,但是如果它使用VFP进行32位浮点运算或64位双运算,那么运行速度非常慢(因为VFP实际上不是SIMD)。 So you should make sure that you are using 32-bit float operations with Accelerate, and make sure it will use NEON instead of VFP. 因此,您应该确保使用Accelerate进行32位浮点运算,并确保它将使用NEON而不是VFP。

And another issue is that even if it does use NEON, there is no guarantee that your C compiler will generate faster NEON code than your simple C function does without NEON instructions, because C compilers such as GCC often generate terrible SIMD code, potentially running slower than standard code. 另一个问题是即使它确实使用了NEON,也不能保证你的C编译器会生成比没有NEON指令的简单C函数更快的NEON代码,因为GCC之类的C编译器经常生成可怕的SIMD代码,可能会运行得更慢比标准代码。 Thats why its always important to test the speed of the generated code, and possibly to manually look at the generated assembly code to see if your compiler generated bad code. 这就是为什么它总是很重要的是测试生成的代码的速度,并可能手动查看生成的汇编代码,看看你的编译器是否生成了错误的代码。

The BLAS and LAPACK libraries are designed for use with what I would consider "medium to large matrices" (from tens to tens of thousands on a side). BLAS和LAPACK库设计用于我认为的“中到大矩阵”(一边从几十到几万)。 They will deliver correct results for smaller matrices, but the performance will not be as good as it could be. 它们将为较小的矩阵提供正确的结果,但性能不会尽可能好。

There are several reasons for this: 有几个原因:

  • In order to deliver top performance, 3x3 and 4x4 matrix operations must be inlined, not in a library; 为了提供最佳性能,必须内联3x3和4x4矩阵操作,而不是在库中; the overhead of making a function call is simply too large to overcome when there is so little work to be done. 进行函数调用的开销太大而无法克服,因为要完成的工作很少。
  • An entirely different set of interfaces is necessary to deliver top performance. 一组完全不同的接口是提供最佳性能所必需的。 The BLAS interface for matrix multiply takes variables to specify the sizes and leading dimensions of the matrices involved in the computation, not to mention whether or not to transpose the matrices and the storage layout. 用于矩阵乘法的BLAS接口采用变量来指定计算中涉及的矩阵的大小和前导维度,更不用说是否转置矩阵和存储布局。 All those parameters make the library powerful, and don't hurt performance for large matrices. 所有这些参数使库变得强大,并且不会损害大型矩阵的性能。 However, by the time it has finished determining that you are doing a 4x4 computation, a function dedicated to doing 4x4 matrix operations and nothing else is already finished. 但是,当它完成确定您正在进行4x4计算时,专用于执行4x4矩阵运算的功能已经完成。

What this means for you: if you would like to have dedicated small matrix operations provided, please go to bugreport.apple.com and file a bug requesting this feature. 这对您意味着什么:如果您希望提供专用的小矩阵操作,请访问bugreport.apple.com并提交请求此功能的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM