C语言中的快速4x4矩阵乘法

Question

I am trying to find an optimized C or Assembler implementation of a function that multiplies two 4x4 matrices with each other. 我试图找到一个函数的优化C或汇编程序实现，它将两个4x4矩阵相互相乘。 The platform is an ARM6 or ARM7 based iPhone or iPod. 该平台是基于ARM6或ARM7的iPhone或iPod。

Currently, I am using a fairly standard approach - just a little loop-unrolled. 目前，我正在使用一种相当标准的方法 - 只需一点循环展开。

#define O(y,x) (y + (x<<2))

static inline void Matrix4x4MultiplyBy4x4 (float *src1, float *src2, float *dest)
{
    *(dest+O(0,0)) = (*(src1+O(0,0)) * *(src2+O(0,0))) + (*(src1+O(0,1)) * *(src2+O(1,0))) + (*(src1+O(0,2)) * *(src2+O(2,0))) + (*(src1+O(0,3)) * *(src2+O(3,0))); 
    *(dest+O(0,1)) = (*(src1+O(0,0)) * *(src2+O(0,1))) + (*(src1+O(0,1)) * *(src2+O(1,1))) + (*(src1+O(0,2)) * *(src2+O(2,1))) + (*(src1+O(0,3)) * *(src2+O(3,1))); 
    *(dest+O(0,2)) = (*(src1+O(0,0)) * *(src2+O(0,2))) + (*(src1+O(0,1)) * *(src2+O(1,2))) + (*(src1+O(0,2)) * *(src2+O(2,2))) + (*(src1+O(0,3)) * *(src2+O(3,2))); 
    *(dest+O(0,3)) = (*(src1+O(0,0)) * *(src2+O(0,3))) + (*(src1+O(0,1)) * *(src2+O(1,3))) + (*(src1+O(0,2)) * *(src2+O(2,3))) + (*(src1+O(0,3)) * *(src2+O(3,3))); 
    *(dest+O(1,0)) = (*(src1+O(1,0)) * *(src2+O(0,0))) + (*(src1+O(1,1)) * *(src2+O(1,0))) + (*(src1+O(1,2)) * *(src2+O(2,0))) + (*(src1+O(1,3)) * *(src2+O(3,0))); 
    *(dest+O(1,1)) = (*(src1+O(1,0)) * *(src2+O(0,1))) + (*(src1+O(1,1)) * *(src2+O(1,1))) + (*(src1+O(1,2)) * *(src2+O(2,1))) + (*(src1+O(1,3)) * *(src2+O(3,1))); 
    *(dest+O(1,2)) = (*(src1+O(1,0)) * *(src2+O(0,2))) + (*(src1+O(1,1)) * *(src2+O(1,2))) + (*(src1+O(1,2)) * *(src2+O(2,2))) + (*(src1+O(1,3)) * *(src2+O(3,2))); 
    *(dest+O(1,3)) = (*(src1+O(1,0)) * *(src2+O(0,3))) + (*(src1+O(1,1)) * *(src2+O(1,3))) + (*(src1+O(1,2)) * *(src2+O(2,3))) + (*(src1+O(1,3)) * *(src2+O(3,3))); 
    *(dest+O(2,0)) = (*(src1+O(2,0)) * *(src2+O(0,0))) + (*(src1+O(2,1)) * *(src2+O(1,0))) + (*(src1+O(2,2)) * *(src2+O(2,0))) + (*(src1+O(2,3)) * *(src2+O(3,0))); 
    *(dest+O(2,1)) = (*(src1+O(2,0)) * *(src2+O(0,1))) + (*(src1+O(2,1)) * *(src2+O(1,1))) + (*(src1+O(2,2)) * *(src2+O(2,1))) + (*(src1+O(2,3)) * *(src2+O(3,1))); 
    *(dest+O(2,2)) = (*(src1+O(2,0)) * *(src2+O(0,2))) + (*(src1+O(2,1)) * *(src2+O(1,2))) + (*(src1+O(2,2)) * *(src2+O(2,2))) + (*(src1+O(2,3)) * *(src2+O(3,2))); 
    *(dest+O(2,3)) = (*(src1+O(2,0)) * *(src2+O(0,3))) + (*(src1+O(2,1)) * *(src2+O(1,3))) + (*(src1+O(2,2)) * *(src2+O(2,3))) + (*(src1+O(2,3)) * *(src2+O(3,3))); 
    *(dest+O(3,0)) = (*(src1+O(3,0)) * *(src2+O(0,0))) + (*(src1+O(3,1)) * *(src2+O(1,0))) + (*(src1+O(3,2)) * *(src2+O(2,0))) + (*(src1+O(3,3)) * *(src2+O(3,0))); 
    *(dest+O(3,1)) = (*(src1+O(3,0)) * *(src2+O(0,1))) + (*(src1+O(3,1)) * *(src2+O(1,1))) + (*(src1+O(3,2)) * *(src2+O(2,1))) + (*(src1+O(3,3)) * *(src2+O(3,1))); 
    *(dest+O(3,2)) = (*(src1+O(3,0)) * *(src2+O(0,2))) + (*(src1+O(3,1)) * *(src2+O(1,2))) + (*(src1+O(3,2)) * *(src2+O(2,2))) + (*(src1+O(3,3)) * *(src2+O(3,2))); 
    *(dest+O(3,3)) = (*(src1+O(3,0)) * *(src2+O(0,3))) + (*(src1+O(3,1)) * *(src2+O(1,3))) + (*(src1+O(3,2)) * *(src2+O(2,3))) + (*(src1+O(3,3)) * *(src2+O(3,3))); 
};

Would I benefit from using the Strassen- or the Coppersmith–Winograd algorithm? 使用Strassen或Coppersmith-Winograd算法我会受益吗？

Answer 1

No, the Strassen or Coppersmith-Winograd algorithm wouldn't make much difference here. 不，Strassen或Coppersmith-Winograd算法在这里没有太大区别。 They start to pay off for larger matrices only. 他们开始只为更大的矩阵付出代价。

If your matrix-multiplication is really a bottleneck you could rewrite the algorithm using NEON SIMD instructions. 如果您的矩阵乘法确实是一个瓶颈，您可以使用NEON SIMD指令重写算法。 That would only help for ARMv7 as ARMv6 does not has this extension though. 这只会对ARMv7有所帮助，因为ARMv6没有这个扩展。

I'd expect a factor 3 speedup over the compiled C-code for your case. 对于你的情况，我希望在编译的C代码上加速3倍。

EDIT: You can find a nice implementation in ARM-NEON here: http://code.google.com/p/math-neon/ 编辑：你可以在这里找到一个很好的ARM-NEON实现： http ： //code.google.com/p/math-neon/

For your C-code there are two things you could do to speed up the code: 对于您的C代码，您可以采取两项措施来加速代码：

Don't inline the function. 不要内联函数。 Your matrix multiplication generates quite a bit of code as it's unrolled, and the ARM only has a very tiny instruction cache. 矩阵乘法会在展开时产生相当多的代码，而ARM只有一个非常小的指令缓存。 Excessive inlining can make your code slower because the CPU will be busy loading code into the cache instead of executing it. 过多的内联会使代码变慢，因为CPU会忙于将代码加载到缓存中而不是执行它。
Use the restrict keyword to tell the compiler that the source- and destination pointers don't overlap in memory. 使用restrict关键字告诉编译器源指针和目标指针在内存中不重叠。 Currently the compiler is forced to reload every source value from memory whenever a result is written because it has to assume that source and destination may overlap or even point to the same memory. 目前，无论何时写入结果，编译器都被迫从存储器重新加载每个源值，因为它必须假设源和目标可能重叠或甚至指向同一存储器。

Answer 2

Just nitpicking. 只是挑剔。 I wonder why people still obfuscate their code voluntarly? 我想知道为什么人们仍然会自愿地混淆他们的代码？ C is already difficult to read, no need to add to it. C已经难以阅读，无需添加。

static inline void Matrix4x4MultiplyBy4x4 (float src1[4][4], float src2[4][4], float dest[4][4])
{
dest[0][0] = src1[0][0] * src2[0][0] + src1[0][1] * src2[1][0] + src1[0][2] * src2[2][0] + src1[0][3] * src2[3][0]; 
dest[0][1] = src1[0][0] * src2[0][1] + src1[0][1] * src2[1][1] + src1[0][2] * src2[2][1] + src1[0][3] * src2[3][1]; 
dest[0][2] = src1[0][0] * src2[0][2] + src1[0][1] * src2[1][2] + src1[0][2] * src2[2][2] + src1[0][3] * src2[3][2]; 
dest[0][3] = src1[0][0] * src2[0][3] + src1[0][1] * src2[1][3] + src1[0][2] * src2[2][3] + src1[0][3] * src2[3][3]; 
dest[1][0] = src1[1][0] * src2[0][0] + src1[1][1] * src2[1][0] + src1[1][2] * src2[2][0] + src1[1][3] * src2[3][0]; 
dest[1][1] = src1[1][0] * src2[0][1] + src1[1][1] * src2[1][1] + src1[1][2] * src2[2][1] + src1[1][3] * src2[3][1]; 
dest[1][2] = src1[1][0] * src2[0][2] + src1[1][1] * src2[1][2] + src1[1][2] * src2[2][2] + src1[1][3] * src2[3][2]; 
dest[1][3] = src1[1][0] * src2[0][3] + src1[1][1] * src2[1][3] + src1[1][2] * src2[2][3] + src1[1][3] * src2[3][3]; 
dest[2][0] = src1[2][0] * src2[0][0] + src1[2][1] * src2[1][0] + src1[2][2] * src2[2][0] + src1[2][3] * src2[3][0]; 
dest[2][1] = src1[2][0] * src2[0][1] + src1[2][1] * src2[1][1] + src1[2][2] * src2[2][1] + src1[2][3] * src2[3][1]; 
dest[2][2] = src1[2][0] * src2[0][2] + src1[2][1] * src2[1][2] + src1[2][2] * src2[2][2] + src1[2][3] * src2[3][2]; 
dest[2][3] = src1[2][0] * src2[0][3] + src1[2][1] * src2[1][3] + src1[2][2] * src2[2][3] + src1[2][3] * src2[3][3]; 
dest[3][0] = src1[3][0] * src2[0][0] + src1[3][1] * src2[1][0] + src1[3][2] * src2[2][0] + src1[3][3] * src2[3][0]; 
dest[3][1] = src1[3][0] * src2[0][1] + src1[3][1] * src2[1][1] + src1[3][2] * src2[2][1] + src1[3][3] * src2[3][1]; 
dest[3][2] = src1[3][0] * src2[0][2] + src1[3][1] * src2[1][2] + src1[3][2] * src2[2][2] + src1[3][3] * src2[3][2]; 
dest[3][3] = src1[3][0] * src2[0][3] + src1[3][1] * src2[1][3] + src1[3][2] * src2[2][3] + src1[3][3] * src2[3][3]; 
};

Answer 3

Are you sure that your unrolled code is faster than the explicit loop based approach? 您确定展开的代码比基于显式循环的方法更快吗？ Mind that the compilers are usually better than humans performing optimizations! 请注意，编译器通常比执行优化的人更好！

In fact, I'd bet there are more chances for a compiler to emit automatically SIMD instructions from a well written loop than from a series of "unrelated" statements... 实际上，我敢打赌，编译器有更多机会从编写良好的循环中自动发出SIMD指令，而不是从一系列“无关”语句中发出...

You could also specify the matrices sizes in the argument declaration. 您还可以在参数声明中指定矩阵大小。 Then you could use the normal bracket syntax to access the elements, and it could also be a good hint for the compiler to make its optimisations too. 然后你可以使用普通的括号语法来访问元素，它也可以是编译器进行优化的一个很好的提示。

Answer 4

Your completely unrolled traditional product is likely pretty fast. 您完全展开的传统产品可能非常快。

Your matrix is too small to overcome the overheard of managing a Strassen multiplication in its traditional form with explicit indexes and partitioning code; 你的矩阵太小了，无法克服传统形式的Strassen乘法与显式索引和分区代码的管理; you'd likely lose any effect on optimization to that overhead. 你可能会对优化失去任何影响。

But if you want fast, I'd use SIMD instructions if they are available. 但是如果你想要快速，我会使用SIMD指令（如果有的话）。 I'd be surprised if the ARM chips these days don't have them. 如果ARM芯片最近没有它们，我会感到惊讶。 If they do, you can manage all the products in row/colum in a single instruction; 如果他们这样做，您可以在一条指令中管理行/列中的所有产品; if the SIMD is 8 wide, you might manage 2 row/column multiplies in a single instruction. 如果SIMD为8宽，则可以在单条指令中管理2行/列乘法。 Setting the operands up to do that instruction might require some dancing around; 将操作数设置为执行该指令可能需要一些跳舞; SIMD instructions will easily pick up your rows (adjacent values), but will not pick up the columns (non-contiguous). SIMD指令将轻松拾取您的行（相邻值），但不会拾取列（非连续）。 And it may take some effort to compute the sum of the products in a row/column. 并且可能需要一些努力来计算行/列中的产品的总和。

Answer 5

Are these arbitrary matrices or do they have any symmetries? 这些任意矩阵还是有任何对称性？ If so, those symmetries can often to exploited for improved performance (for example in rotation matrices). 如果是这样，那些对称性通常可用于提高性能（例如在旋转矩阵中）。

Also, I agree with fortran above, and would run some timing tests to verify that your hand unrolled code is faster than an optimizing compiler can create. 此外，我同意上面的fortran，并会运行一些时序测试来验证您的手展开代码比优化编译器可以创建的更快。 At the least, you may be able to simplify your code. 至少，您可以简化代码。

Paul 保罗

C语言中的快速4x4矩阵乘法

问题描述

5 个解决方案

解决方案1
36 已采纳 2009-11-04 14:21:20

解决方案2
20 2009-11-04 15:35:34

解决方案3
3 2009-11-04 15:08:59

解决方案4
2 2009-11-04 14:25:05

解决方案5
2 2009-11-04 16:09:07

C语言中的快速4x4矩阵乘法

问题描述

5 个解决方案

解决方案1 36 已采纳 2009-11-04 14:21:20

解决方案2 20 2009-11-04 15:35:34

解决方案3 3 2009-11-04 15:08:59

解决方案4 2 2009-11-04 14:25:05

解决方案5 2 2009-11-04 16:09:07

解决方案1
36 已采纳 2009-11-04 14:21:20

解决方案2
20 2009-11-04 15:35:34

解决方案3
3 2009-11-04 15:08:59

解决方案4
2 2009-11-04 14:25:05

解决方案5
2 2009-11-04 16:09:07