如何使用ARM NEON优化循环4D矩阵向量乘法？

Question

I'm working on optimizing a 4D (128 Bit) matrix-vector multiplication using ARM NEON Assembler. 我正在使用ARM NEON Assembler优化4D（128位）矩阵向量乘法。

If I load the matrix, and the vector into the NEON Registers and transform it, I won't get a great performance boost, because the switch to the NEON Registers cost 20 cycles. 如果我将矩阵和矢量加载到NEON寄存器并对其进行转换，我将无法获得极大的性能提升，因为切换到NEON寄存器需要20个周期。 Furthermore I reload the matrix for each multiplication, despite it has not changed. 此外，我为每次乘法重新加载矩阵，尽管它没有改变。

There is enough register-space to perform the transformation on more vectors a time. 有足够的寄存器空间可以在更多的向量上执行转换。 This IS increasing performance. 这是在提高性能。

But.. 但..

I'm wondering how fast this operation would be if I do the loop over all vertices (increasing pointers) within the assembler. 我想知道如果我在汇编程序中对所有顶点（增加指针）进行循环，这个操作会有多快。 But I am at the very beginning of Neon assembler and though don't know how to do this. 但我在霓虹汇编程序的最开始，虽然不知道如何做到这一点。 Can someone give me an hand on that? 有人能帮我一把吗？

What I want to achieve: 我想要实现的目标：

load matrix and first vector 加载矩阵和第一个向量
store loop count "count" and.. 存储循环计数“计数”和..
-- LOOP_START -- - LOOP_START -
perform multiply-adds (do the Transformation) 执行乘法 - 添加（执行转换）
write q0 to vOut 将q0写入vOut
increase pointers vIn and vOut by 4 (128 Bit) 将指针vIn和vOut增加4（128位）
LOAD vIn to q5. LOAD vIn到q5。
-- LOOP_END -- - LOOP_END -

existing C-Version of loop: 现有的C版循环：

void TransformVertices(ESMatrix* m, GLfloat* vertices, GLfloat* normals, int count)
{
    GLfloat* pVertex = vertices;
    int i;  

    // iterate trough vertices only one at a time
    for (i = 0; i < count ; i ++)
    {
        Matrix4Vector4Mul( (float *)m, (float *)pVertex, (float *)pVertex);
        pVertex += 4;
    }

    //LoadMatrix( (const float*) m);

    //// two at a time
    //for (i = 0; i < count ; i += 2)
    //{
    //    Matrix4Vector4Mul2( (float *)m, (float *)pVertex, (float *)(pVertex + 4));
    //      pVertex += 8;
    //}
}

Following code for NEON-Version on doing only one transformation: 以下代码仅针对一次转换执行NEON-Version：

void Matrix4Vector4Mul (const float* m, const float* vIn, float* vOut)
{    
    asm volatile
    (

    "vldmia %1, {q1-q4 }     \n\t"
    "vldmia %2, {q5}         \n\t"

    "vmul.f32 q0, q1, d10[0] \n\t"        
    "vmla.f32 q0, q2, d10[1] \n\t"      
    "vmla.f32 q0, q3, d11[0] \n\t"        
    "vmla.f32 q0, q4, d11[1] \n\t"

    "vstmia %0, {q0}"

    : // no output
    : "r" (vOut), "r" (m), "r" (vIn)       
    : "memory", "q0", "q1", "q2", "q3", "q4", "q5" 
    );

}

C-Version of transformation: C版转型：

void Matrix4Vector4Mul (const float* m, const float* vIn, float* vOut)
{
    Vertex4D* v1 =    (Vertex4D*)vIn;
    Vertex4D vOut1;
    Vertex4D* l0;
    Vertex4D* l1;
    Vertex4D* l2;
    Vertex4D* l3;

    // 4x4 Matrix with members m00 - m33 
    ESMatrix* m1 = (ESMatrix*)m;

    l0 = (Vertex4D*)&m1->m00;
    vOut1.x = l0->x * v1->x;
    vOut1.y = l0->y * v1->x;
    vOut1.z = l0->z * v1->x;
    vOut1.w = l0->w * v1->x;

    l1 = (Vertex4D*)&m1->m10;
    vOut1.x += l1->x * v1->y;
    vOut1.y += l1->y * v1->y;
    vOut1.z += l1->z * v1->y;
    vOut1.w += l1->w * v1->y;

    l2 = (Vertex4D*)&m1->m20;
    vOut1.x += l2->x * v1->z;
    vOut1.y += l2->y * v1->z;
    vOut1.z += l2->z * v1->z;
    vOut1.w += l2->w * v1->z;

    l3 = (Vertex4D*)&m1->m30;
    vOut1.x += l3->x * v1->w;
    vOut1.y += l3->y * v1->w;
    vOut1.z += l3->z * v1->w;
    vOut1.w += l3->w * v1->w;

    *(vOut) = vOut1.x;
    *(vOut + 1) = vOut1.y;
    *(vOut + 2) = vOut1.z;
    *(vOut + 3) = vOut1.w;
}

Performance: (Transform > 90 000 Vertices | Android 4.0.4 SGS II) 表现:(变换> 90 000顶点| Android 4.0.4 SGS II）

C-Version:    190 FPS 
NEON-Version: 162 FPS ( .. slower -.- )

--- LOAD Matrix only ONCE (seperate ASM) and then perform two V's at a time ---

NEON-Version: 217 FPS ( + 33 % NEON | + 14 % C-Code )

Answer 1

Did you try playing with compiler flags? 你尝试过使用编译器标志吗？

-mcpu=cortex-a9 -mtune=cortex-a9 -mfloat-abi=softfp -mfpu=neon -O3

does pretty job for me in this case (gcc 4.4.3, distributed with Android NDK 8b). 在这种情况下（gcc 4.4.3，与Android NDK 8b一起发布）对我来说非常有用。 Try to have tight source code by defining internal functions static and inline as well as moving matrix (m[X][0] stuff) to static global variables or just merge Matrix4Vector4Mul into loop and make matrix local variables instead of keep passing it in function - gcc doesn't get smart there. 尝试通过定义内部函数static和inline以及将矩阵（m [X] [0] stuff）移动到静态全局变量或者将Matrix4Vector4Mul合并到循环中来制作矩阵局部变量而不是将其传递给函数来获得紧密的源代码 - gcc在那里并不聪明。

When I do this, I get below for the main loop. 当我这样做时，我得到主循环下面。

  a4:   ed567a03    vldr    s15, [r6, #-12]
  a8:   ee276aa0    vmul.f32    s12, s15, s1
  ac:   ee676aa8    vmul.f32    s13, s15, s17
  b0:   ed564a04    vldr    s9, [r6, #-16]
  b4:   ee277a88    vmul.f32    s14, s15, s16
  b8:   ed165a02    vldr    s10, [r6, #-8]
  bc:   ee677a80    vmul.f32    s15, s15, s0
  c0:   ed565a01    vldr    s11, [r6, #-4]
  c4:   e2833001    add r3, r3, #1
  c8:   ee046a89    vmla.f32    s12, s9, s18
  cc:   e1530004    cmp r3, r4
  d0:   ee446aaa    vmla.f32    s13, s9, s21
  d4:   ee047a8a    vmla.f32    s14, s9, s20
  d8:   ee447aa9    vmla.f32    s15, s9, s19
  dc:   ee056a22    vmla.f32    s12, s10, s5
  e0:   ee456a01    vmla.f32    s13, s10, s2
  e4:   ee057a21    vmla.f32    s14, s10, s3
  e8:   ee457a02    vmla.f32    s15, s10, s4
  ec:   ee056a8b    vmla.f32    s12, s11, s22
  f0:   ee456a83    vmla.f32    s13, s11, s6
  f4:   ee057aa3    vmla.f32    s14, s11, s7
  f8:   ee457a84    vmla.f32    s15, s11, s8
  fc:   ed066a01    vstr    s12, [r6, #-4]
 100:   ed466a04    vstr    s13, [r6, #-16]
 104:   ed067a03    vstr    s14, [r6, #-12]
 108:   ed467a02    vstr    s15, [r6, #-8]
 10c:   e2866010    add r6, r6, #16
 110:   1affffe3    bne a4 <TransformVertices+0xa4>

Having 4 loads, 4 multiplies, 12 multiply and accumulates and 4 stores which matches with what you are doing in Matrix4Vector4Mul. 有4个负载，4个乘法，12个乘法和累加，以及4个与你在Matrix4Vector4Mul中所做的匹配的存储。

If you are still not satisfied with compiler generated code, pass compiler '-S' to get assembly output and use that as a starting point to improve further instead of starting from scratch. 如果您对编译器生成的代码仍然不满意，请通过编译器'-S'获取程序集输出，并将其作为起点进一步改进，而不是从头开始。

You should also check that vertices is cache line size aligned (32 bytes for Cortex-A9) to get a nice data flow. 您还应检查vertices是否对齐高速缓存行大小（Cortex-A9为32个字节）以获得良好的数据流。

For vectorization there are gcc options like -ftree-vectorizer-verbose=9 to print information what was vectorized. 对于矢量化，有像-ftree-vectorizer-verbose=9这样的gcc选项来打印矢量化的信息。 Also search in gcc documentation this one to see how you can direct gcc or what you need to modify to get your multiplications vectorized. 同时在gcc文档中搜索这个，看看你如何指导gcc或你需要修改什么来使你的乘法矢量化。 This might sound a lot to dig in but it would be more fruitful for you in long run than 'hand vectorizing'. 这可能听起来很多，但从长远来看，它比“手工矢量化”更有成效。

Answer 2

The hand tuned neon version suffers from dependency between all of the operations, while gcc is able to do out-of-order scheduling for the c-version. 手动调整的霓虹灯版本在所有操作之间存在依赖性，而gcc能够对c版本进行无序调度。 You should be able to improve the NEON version by calculating in parallel two or more independent threads: 您应该能够通过并行计算两个或多个独立线程来改进NEON版本：

Pointer increment (post increment) in NEON is done with exclamation mark. NEON中的指针增量（后增量）用感叹号完成。 Those registers then should be included in the output register list "=r" (vOut) 那些寄存器应该包含在输出寄存器列表“= r”（vOut）中

vld1.32 {d0,d1}, [%2]!   ; // next round %2=%2 + 16 
vst1.32 {d0},    [%3]!   ; // next round %3=%3 + 8

Another addressing mode allows post increment by a "stride" defined in another arm register. 另一种寻址模式允许通过另一个臂寄存器中定义的“步幅”进行后增量。 The option is available only on some load commands (as there are a variety of interleaving options as well as loading to chosen elements of say d1[1] (upper part)). 该选项仅在某些加载命令上可用（因为有各种交错选项以及加载到所选元素d1 [1]（上部））。

vld1.16 d0, [%2], %3    ; // increment by register %3

The counter increment happens with sequence 计数器增量按顺序发生

1: subs %3, %3, #1      ; // with "=r" (count) as fourth argument
bne 1b                  ; // create a local label

Local label is used, as two "bne loop" statements in the same file causes an error 使用本地标签，因为同一文件中的两个“bne loop”语句会导致错误

One should be able to increase parallelism by a factor of four by calculating fused multiply adds for vectors instead of single elements. 通过计算向量而不是单个元素的融合乘法加法，应该能够将并行度增加四倍。

In this case it's worthwhile to perform a matrix transpose in advance (either before calling the routine or with special addressing mode). 在这种情况下，提前执行矩阵转置（在调用例程或使用特殊寻址模式之前）是值得的。

asm(
   "vld1.32 {d0[0],d2[0],d4[0],d6[0]}, [%0]! \n\t"
   "vld1.32 {d0[1],d2[1],d4[1],d6[1]}, [%0]! \n\t"
   "vld1.32 {d1[0],d3[0],d5[0],d7[0]}, [%0]! \n\t"
   "vld1.32 {d1[1],d3[1],d5[1],d7[1]}, [%0]! \n\t"

   "vld1.32 {q8}, [%2:128]! \n\t"
   "vld1.32 {q9}, [%2:128]! \n\t"
   "vld1.32 {q10}, [%2:128]! \n\t"
   "vld1.32 {q11}, [%2:128]! \n\t"

   "subs %0, %0, %0 \n\t"   // set zero flag

   "1: \n\t"
   "vst1.32 {q4}, [%1:128]! \n\t"
   "vmul.f32 q4, q8, q0 \n\t"
   "vst1.32 {q5}, [%1:128]! \n\t"
   "vmul.f32 q5, q9, q0 \n\t"
   "vst1.32 {q6}, [%1:128]! \n\t"
   "vmul.f32 q6, q10, q0 \n\t"
   "vst1.32 {q7}, [%1:128]!  \n\t"
   "vmul.f32 q7, q11, q0 \n\t"

   "subne %1,%1, #64    \n\t"    // revert writing pointer in 1st iteration 

   "vmla.f32 q4, q8, q1 \n\t"
   "vmla.f32 q5, q9, q1 \n\t"
   "vmla.f32 q6, q10, q1 \n\t"
   "vmla.f32 q7, q11, q1 \n\t"
   "subs %2, %2, #1 \n\t"
   "vmla.f32 q4, q8, q2 \n\t"
   "vmla.f32 q5, q9, q2 \n\t"
   "vmla.f32 q6, q10, q2 \n\t"
   "vmla.f32 q7, q11, q2 \n\t"

   "vmla.f32 q4, q8, q3 \n\t"
   "vld1.32 {q8}, [%2:128]! \n\t"  // start loading vectors immediately
   "vmla.f32 q5, q9, q3 \n\t"
   "vld1.32 {q9}, [%2:128]! \n\t"  // when all arithmetic is done
   "vmla.f32 q6, q10, q3 \n\t"
   "vld1.32 {q10}, [%2:128]! \n\t"
   "vmla.f32 q7, q11, q3 \n\t"
   "vld1.32 {q11}, [%2:128]! \n\t"
   "jnz b1 \n\t"
   "vst1.32 {q4,q5}, [%1:128]! \n\t"  // write after first loop
   "vst1.32 {q6,q7}, [%1:128]! \n\t"
 : "=r" (m), "=r" (vOut), "=r" (vIn), "=r" ( N ), 
 :
 : "d0","d1","q0", ... ); // marking q0 isn't enough for some gcc version

Read and write to 128 bit aligned blocks (make sure the data ptr is aligned too) 读取和写入128位对齐块（确保数据ptr也对齐）
there's a malloc with align, or just adjust manually ptr=((int)ptr + 15) & ~15 . 有一个带对齐的malloc，或者只是手动调整ptr=((int)ptr + 15) & ~15 。

Just as there is a post loop block writing the results, one can write a similar pre loop block that skips the first write of nonsense to vOut (that could also be overcome by conditional write). 正如有一个后循环块写入结果，可以编写一个类似的预循环块，跳过第一次写废话到vOut（也可以通过条件写入克服）。 One can unfortunately only write 64 bit registers conditionally. 遗憾的是，只能有条件地写入64位寄存器。

Answer 3

It's an almost one full year old topic by now, but I think it's important to give you the "correct" answer since something is very fishy here, and noone has pointed this out so far : 这是一个差不多整整一年的话题，但我认为重要的是给你“正确的”答案，因为这里的东西非常腥，到目前为止还没有人指出这一点：

You should avoid using q4-q7 if possible since they have to be preserved prior to use 如果可能，您应该避免使用q4-q7，因为它们必须在使用前保留
Correct me if I'm wrong on this, but if my memory isn't failing me, only d0~d3 (or d0~d7) can hold scalars. 如果我错了，请纠正我，但如果我的记忆没有让我失望，只有d0~d3（或d0~d7）可以容纳标量。 I'm really wondering why gcc is tolerating d10 and d11 as scalar operands. 我真的很想知道为什么gcc容忍d10和d11作为标量操作数。 Since it's physically impossible that way, I guess gcc is again doing something crazy with your inline assembly. 因为它在物理上是不可能的，所以我猜gcc会再次对你的内联汇编做一些疯狂的事情。 Check out the disassembly of your inline assembly code. 查看内联汇编代码的反汇编。

True, your inline assembly code suffers from two interlocks (2cycles after load and 9 cycles before store), but it's unimaginable for me that the NEON code runs slower than the C code. 没错，你的内联汇编代码有两个互锁（加载后2个循环和存储前9个循环），但是对于我来说NEON代码比C代码运行得慢，这是不可想象的。

It's a really strong guess from my side that gcc does some heavy register transferring back and forth instead spitting out an error message. 从我的角度来看，这是一个非常强烈的猜测，gcc做了一些繁重的寄存器来回传输而不是吐出错误信息。 And it isn't exactly doing a favor in this case. 在这种情况下，它并不完全有利。

如何使用ARM NEON优化循环4D矩阵向量乘法？

问题描述

3 个解决方案

解决方案1
1 2012-10-21 00:15:56

解决方案2
0 2012-10-19 15:32:21

解决方案3
0 2013-08-29 07:56:10

如何使用ARM NEON优化循环4D矩阵向量乘法？

问题描述

3 个解决方案

解决方案1 1 2012-10-21 00:15:56

解决方案2 0 2012-10-19 15:32:21

解决方案3 0 2013-08-29 07:56:10

解决方案1
1 2012-10-21 00:15:56

解决方案2
0 2012-10-19 15:32:21

解决方案3
0 2013-08-29 07:56:10