ARM Cortex-M4 C 代码中的高效嵌入式定点 2x2 矩阵乘法

Question

I am trying to implement a VERY efficient 2x2 matrix multiplication in C code for operation in an ARM Cortex-M4.我正在尝试在 C 代码中实现非常有效的 2x2 矩阵乘法，以便在 ARM Cortex-M4 中进行操作。 The function accepts 3 pointers to 2x2 arrays, 2 for the inputs to be multiplied and an output buffer passed by the using function. The function accepts 3 pointers to 2x2 arrays, 2 for the inputs to be multiplied and an output buffer passed by the using function. Here is what I have so far...这是我到目前为止...

static inline void multiply_2x2_2x2(int16_t a[2][2], int16_t b[2][2], int32_t c[2][2])
{
  int32_t a00a01, a10a11, b00b01, b01b11;

  a00a01 = a[0][0] | a[0][1]<<16;
  b00b10 = b[0][0] | b[1][0]<<16;
  b01b11 = b[0][1] | b[1][1]<<16;
  c[0][0] = __SMUAD(a00a01, b00b10);
  c[0][1] = __SMUAD(a00a01, b01b11);

  a10a11 = a[1][0] | a[1][1]<<16;
  c[1][0] = __SMUAD(a10a11, b00b10);
  c[1][1] = __SMUAD(a10a11, b01b11);
}

Basically, my strategy is to to use the ARM Cortex-M4 __SMUAD() function to do the actual multiply accumulates.基本上，我的策略是使用 ARM Cortex-M4 __SMUAD() function 进行实际的乘法累加。 But this requires me to build the inputs a00a01, a10a11, b00b10, and b01b11 ahead of time.但这需要我提前构建输入 a00a01、a10a11、b00b10 和 b01b11。 My question is, given that the C array should be a continuous in memory, is there a more efficient wat to pass the data into the functions directly without the intermediate variables?我的问题是，鉴于 C 数组在 memory 中应该是连续的，是否有更有效的方式将数据直接传递到函数中而无需中间变量？ Secondary question, am I overthinking this and I should just let the compiler do its job as it is smarter than I am?第二个问题，我是不是想太多了，我应该让编译器完成它的工作，因为它比我更聪明吗？ I tend to do that a lot.我经常这样做。

Thanks!谢谢！

Answer 1

You could break the strict aliasing rules and load the matrix row directly into the 32-bit register, using a int16_t* to int32_t* typecast.您可以打破严格的别名规则，并使用int16_t*到int32_t*类型转换将矩阵行直接加载到 32 位寄存器中。 An expression such as a00a01 = a[0][0] | a[0][1]<<16一个表达式，例如a00a01 = a[0][0] | a[0][1]<<16 a00a01 = a[0][0] | a[0][1]<<16 just takes some consecutive bits from RAM and arranges them into other consecutive bits in registers. a00a01 = a[0][0] | a[0][1]<<16只是从 RAM 中获取一些连续位并将它们排列到寄存器中的其他连续位中。 Consult your compiler manual for the flag to disable its strict aliasing assumptions, and make the cast safely usable.请查阅您的编译器手册以了解该标志以禁用其严格的别名假设，并使强制转换安全可用。

You could also perhaps avoid transposing matrix columns into registers, by generating b in transposed format in the first place.您也可以通过首先以转置格式生成b来避免将矩阵列转置到寄存器中。

The best way to learn about the compiler, and get a sense of the cases for which it's smarter than you, is to disassemble its results and compare the instruction sequence to your intentions.了解编译器并了解它比您更聪明的情况的最佳方法是反汇编其结果并将指令序列与您的意图进行比较。

Answer 2

The first main concern is that some_signed_int << 16 invokes undefined behavior for negative numbers.第一个主要问题是some_signed_int << 16为负数调用未定义的行为。 So you have bugs all over.所以你到处都是错误。 And then bitwise OR of two int16_t where either is negative does not necessarily form a valid int32_t either.然后两个int16_t的按位或，其中一个为负也不一定形成有效的int32_t 。 Do you actually need the sign or can you drop it?你真的需要这个标志还是可以放弃它？

ARM examples use unsigned int , which in turn supposedly contains 2x int16_t in raw binary form. ARM 示例使用unsigned int ，它又应该包含原始二进制形式的 2x int16_t 。 This is what you actually want too.这也是你真正想要的。

Also it would seem that it shouldn't matter for SMUAD which 16 bit word you place where.此外，对于SMUAD ，您将哪个 16 位字放在哪里似乎并不重要。 So the a[0][0] | a[0][1]<<16;所以a[0][0] | a[0][1]<<16; a[0][0] | a[0][1]<<16; just serves to needlessly swap data around in memory.只是用于在 memory 中不必要地交换数据。 It will confuse the compiler which can't optimize such code well.它会使无法很好地优化此类代码的编译器感到困惑。 Sure, shifts etc are always very fast, but this is pointless overhead.当然，轮班等总是非常快，但这是毫无意义的开销。

(As someone noted, this whole thing is likely much easier to write in pure assembler without concern of all the C type rules and undefined behavior.) （正如有人指出的那样，这整个事情可能更容易用纯汇编程序编写，而不用担心所有 C 类型规则和未定义的行为。）

To avoid all these issues you could define your own union type:为了避免所有这些问题，您可以定义自己的联合类型：

typedef union
{
  int16_t  i16 [2][2];
  uint32_t u32 [2];
} mat2x2_t;

u32[0] corresponds to i16[0][0] and i16[0][1] u32[0]对应i16[0][0]和i16[0][1]
u32[1] corresponds to i16[1][0] and i16[1][1] u32[1]对应i16[1][0]和i16[1][1]

C actually lets you "type pun" between these types pretty wildly (unlike C++). C 实际上让您在这些类型之间“输入双关语”非常疯狂（与 C++ 不同）。 Unions also dodge the brittle strict aliasing rules.工会也避开了脆弱的严格别名规则。

The function can then become something along the lines of this pseudo code:然后 function 可以变成类似于此伪代码的内容：

static uint32_t mat_mul16 (mat2x2_t a, mat2x2_t b)
{
   uint32_t c0 = __SMUAD(a.u32[0], b.u32[0]);
   ...
}

Supposedly each such line should give 2x signed 16 multiplications as per the SMUAD instruction.根据SMUAD指令，假设每个这样的行应该给出 2x 有符号的 16 次乘法。

As for if this actually gives some revolutionary performance increase compared to some default MUL , I kind of doubt it.至于与某些默认的MUL相比，这是否真的带来了革命性的性能提升，我有点怀疑。 Disassemble and count CPU ticks.反汇编并计算 CPU 滴答声。

am I overthinking this and I should just let the compiler do its job as it is smarter than I am?我是不是想多了，我应该让编译器完成它的工作，因为它比我更聪明？

Most likely:) The old rule of thumb: benchmark and then only manually optimize at the point when you've actually found a performance bottleneck.最有可能:) 旧的经验法则：基准测试，然后仅在您实际发现性能瓶颈时手动优化。

ARM Cortex-M4 C 代码中的高效嵌入式定点 2x2 矩阵乘法

问题描述

2 个解决方案

解决方案1
2 2021-05-06 17:07:59

解决方案2
1 已采纳 2021-05-07 14:07:45

ARM Cortex-M4 C 代码中的高效嵌入式定点 2x2 矩阵乘法

问题描述

2 个解决方案

解决方案1 2 2021-05-06 17:07:59

解决方案2 1 已采纳 2021-05-07 14:07:45

解决方案1
2 2021-05-06 17:07:59

解决方案2
1 已采纳 2021-05-07 14:07:45