iOS BLAS-加速框架不良的矩阵乘法性能

Question

I'm implementing a tangent distance based OCR solution for the iPhone, which heavily relies on fast multiplication of floating-point matrices of size 253x7. 我正在为iPhone实现基于切线距离的OCR解决方案，该解决方案在很大程度上依赖于253x7大小的浮点矩阵的快速乘法。 For the proof of concept, I've implemented my own naive matrix routines like this: 为了概念验证，我实现了自己的朴素矩阵例程，如下所示：

Matrix operator*(const Matrix& matrix) const {
    if(cols != matrix.rows) throw "cant multiply!";

    Matrix result(rows, matrix.cols);
    for(int i = 0; i < result.rows; i++){
        for(int j = 0; j < result.cols; j++){
            T tmp = 0;
            for(int k = 0; k < cols; k++){
                tmp += at(i,k) * matrix.at(k,j);
            }
            result.at(i,j) = tmp;
        }
    }

    return result;
}

As you can see, it's pretty basic. 如您所见，这是非常基本的。 After the PoC performed well, I've decided to push the performance limits further, by incorporating the Accelerate Framework 's matrix multiplication (which presumably uses SIMD, and other fancy stuff, to do the heavy lifting...): 在PoC表现良好之后，我决定通过合并Accelerate Framework的矩阵乘法（可能使用SIMD和其他奇特的东西来完成繁重的工作）来进一步提高性能极限：

Matrix operator*(const Matrix& m) const {
    if(cols != m.rows) throw "cant multiply!";

    Matrix result(rows,m.cols);

    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, rows, m.cols, cols, 1, matrix, cols, m.matrix, m.cols, 1, result.matrix, result.cols);

    return result;
}

Shockingly (at least for me), the above code took double the time to multiply the matrices! 令人震惊的（至少对我来说），上面的代码花费了两倍的时间来乘矩阵！ I've tried using single precision instead of double, cause I was suspicious that it was something related with the CPU's word-size (32 bit float vs. 64 bit double on a 32 bit ARM), but had no performance gain... 我尝试使用单精度而不是双精度，因为我怀疑这与CPU的字长有关（32位浮点数与32位ARM上的64位双精度）有关，但是却没有性能提升...

What am I doing wrong? 我究竟做错了什么？ Are my 253x7 matrices too small for noticeable performance boost over the naive implementation? 我的253x7矩阵是否太小，无法通过朴素的实施方式显着提升性能？

Answer 1

A couple of questions: 几个问题：

253 x 7 multiplied by what size matrix? 253 x 7乘以什么尺寸的矩阵？ If you're doing, say, 253x7 * 7x1, then a general-purpose multiply routine is going to spend most of its time in edging code, and there's very little that a tuned library can do that will make it faster than a naive implementation. 如果您正在执行253x7 * 7x1的操作，那么通用乘法例程将把大部分时间都花在对代码进行边缘处理上，而经过调优的库几乎可以做到这一点，它将使其比天真的实现更快。。
What hardware are you timing on, and what iOS version? 您选择使用哪种硬件，以及哪个iOS版本？ Especially for double precision, older hardware and older iOS versions are more limited performance-wise. 尤其是对于双精度，较旧的硬件和较旧的iOS版本在性能方面受到更多限制。 On a Cortex-A8, for example, double-precision arithmetic is completely unpipelined, so there's almost nothing a library can do to beat a naive implementation. 例如，在Cortex-A8上，双精度算术是完全不流水线的，因此，库几乎没有什么能击败天真的实现。

If the other matrix isn't ridiculously small, and the hardware is recent, please file a bug (unexpectedly low performance is absolutely a bug). 如果其他矩阵的尺寸不是太小，并且硬件是最新的，请提出一个错误（出乎意料的是，低性能绝对是一个错误）。 Small matrices with high aspect ratio are quite difficult to make fast in a general-purpose matrix-multiply, but it's still a good bug to file. 具有高长宽比的小型矩阵在通用矩阵乘法中很难快速制作，但是仍然是一个很好的缺陷。

If the hardware/iOS version is old, you may want to use Accelerate anyway, as it should perform significantly better on newer hardware/software. 如果硬件/ iOS版本较旧，则可能仍要使用Accelerate，因为它在较新的硬件/软件上应具有明显更好的性能。

If the other matrix is just tiny, then there may not be much to be done. 如果另一个矩阵很小，那么可能没有太多事情要做。 There is no double-precision SIMD on ARM, the matrices are too small to benefit from cache blocking, and the dimensions of the matrices would also be too small to benefit much from loop unrolling. ARM上没有双精度SIMD，矩阵太小而无法从缓存阻止中受益，矩阵的尺寸也太小而无法从循环展开中受益。

If you know a priori that your matrices will be exactly 253x7 * 7x???, you should be able to do much better than both a naive implementation and any general-purpose library by completely unrolling the inner dimension of the matrix multiplication. 如果您先验地知道您的矩阵将精确为253x7 * 7x ???，那么通过完全展开矩阵乘法的内部维数，您应该能够比单纯的实现和任何通用库都做得更好。

Answer 2

Basically, yes. 基本上是。 The "x7" portion is likely too small to make the overhead of CBLAS worth it. “ x7”部分可能太小，不足以使CBLAS的开销值得。 The cost of making a function call, plus all the flexibility that the CBLAS functions give you, takes a while to make back up. 进行函数调用的成本以及CBLAS函数为您提供的所有灵活性，需要花费一些时间来进行备份。 Every time you pass a option like CblasNoTrans remember that there's an if() in there to manage that option. 每次传递诸如CblasNoTrans类的选项时， CblasNoTrans记住，那里有一个if()来管理该选项。 cblas_dgemm in particular accumulates into C, so it has to read the previous result element, apply a multiply, and then add before storing. cblas_dgemm特别是累加到C中，因此它必须读取前一个结果元素，应用一个乘法，然后在存储之前相加。 That's a lot of extra work. 这是很多额外的工作。

You may want to try the vDSP functions rather than CBLAS. 您可能要尝试使用vDSP功能而不是CBLAS。 vDSP_mmul is a bit simpler and doesn't accumulate into the result. vDSP_mmul有点简单，不会累积到结果中。 I've had good luck with vDSP_* on small data sets (a few thousand elements). 我在小型数据集（几千个元素）上使用vDSP_*感到很幸运。

That said, my experience with this is that naïve C implementations can often be quite fast on small data sets. 就是说，我的经验是，天真的C实现在小型数据集上通常可以非常快。 Avoiding a function call is a huge benefit. 避免函数调用是一个巨大的好处。 Speaking of which, make sure that your at() call is inlined. 说到这，请确保内联了at()调用。 Otherwise you're wasting a lot of time in your loop. 否则，您会在循环中浪费大量时间。 You can likely speed up the C implementation by using pointer additions to move serially through your matrices rather than multiplies (which are required for random access through [] ). 您可以使用指针加法来在矩阵中顺序移动而不是乘以（通过[]进行随机访问是必需的），从而可能加快C实现的速度。 On a matrix this small, it may or not be worth it; 在这么小的矩阵上，可能不值得。 you'd have to profile a bit. 您必须进行简要介绍。 Looking at the assembler output is highly instructive. 查看汇编器输出很有启发性。

Keep in mind that you absolutely must profile this stuff on the device. 请记住，您绝对必须在设备上分析这些内容。 Performance in the simulator is irrelevant. 模拟器中的性能无关紧要。 It's not just that the simulator is faster; 不仅仅是模拟器更快。 it's completely different. 这是完全不同的。 Things that are wildly faster on the simulator can be much slower on device. 在模拟器上快得多的事情在设备上可能慢得多。

iOS BLAS-加速框架不良的矩阵乘法性能

问题描述

2 个解决方案

解决方案1
1 2013-04-02 22:02:48

解决方案2
0 已采纳 2013-04-02 21:44:20

iOS BLAS-加速框架不良的矩阵乘法性能

问题描述

2 个解决方案

解决方案1 1 2013-04-02 22:02:48

解决方案2 0 已采纳 2013-04-02 21:44:20

解决方案1
1 2013-04-02 22:02:48

解决方案2
0 已采纳 2013-04-02 21:44:20