简体   繁体   English

ARM汇编程序NEON - 提高性能

[英]ARM Assembler NEON - Increasing performance

I have converted part of an algorithm from C to ARM Assembler (using NEON instructions), but now it is 2x slower than the original C Code. 我已将部分算法从C转换为ARM汇编程序(使用NEON指令),但现在它比原始C代码慢2倍。 How can I improve performance? 如何提高性能?

Target is a ARM Cortex-A9. Target是ARM Cortex-A9。

The algorithm reads 64Bit-values from an array. 该算法从数组中读取64位值。 From this value one byte is extracted, which is then used as the lookup-value for another table. 从该值中提取一个字节,然后将其用作另一个表的查找值。 This part is done about 10 times, and each resulting table value is XOR´d with the others and the final result written into another array. 这部分大约完成了10次,每个结果表值与其他值进行异或,最终结果写入另一个数组。

Something like this: 像这样的东西:

result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];

In my approach i load the whole array "a" in Neon Registers and then move the right byte in an arm register, calculate the offset and then load the value from the table: 在我的方法中,我在氖寄存器中加载整个数组“a”,然后在arm寄存器中移动右字节,计算偏移量,然后从表中加载值:

vldm.64 r0, {d0-d7}         //Load 8x64Bit from the input array

vmov.u8 r12, d0[0]          //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3    // r12 = base_adress + r12 << 3
vldr.64 d8, [r12]           // d8 = mem[r12]
.
.
.
veor d8, d8, d9             // d8 = d8 ^ d9
veor d8, d8, d10            // d8 = d8 ^d10      ...ect.

Where r2 holds the base adress of the lookup table. 其中r2保存查找表的基址。

adress = Table_adress + (8* value_fromByte);

This step (except the loading at the beginning) is done like 100 times. 此步骤(开头加载除外)完成100次。 Why is this so slow? 为什么这么慢?

Also what are the differences between "vld" , "vldr" and "vldm" - and which one is the fastest. “vld”“vldr”“vldm”之间有什么区别 - 哪一个是最快的。 How can i perform the offset calculation only within Neon registers? 如何仅在霓虹灯寄存器中执行偏移计算? Thank you. 谢谢。

Neon isn't very capable of dealing with Lookups larger than the VTBL instruction's limits(32bytes if I remember correctly). Neon不太能处理大于VTBL指令限制的Lookup(如果我没记错的话,是32字节)。
How's the lookup table created to start with? 如何创建查找表以开始? If it's just calculations, just let Neon do the math instead of resorting to lookups. 如果它只是计算,只需让Neon做数学而不是求助于查找。 It will be much faster this way. 这种方式会快得多。

don't use 不要用

vmov.u8 r12, d0[0]

moving data from NEON register to the ARM register is the worst thing you can do. 将数据从NEON寄存器移动到ARM寄存器是您可以做的最糟糕的事情。

Maybe you should see VTBL instruction ! 也许你应该看看VTBL指令! What is you byte range 0..255 ? 你的字节范围是0..255?

May be you can try 也许你可以试试

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d0}, [r3]

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d1}, [r3]
veor     d0, d0, d1         // d8 = d8 ^ d1

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d1}, [r3]
veor     d0, d0, d1         // d8 = d8 ^ d1

...

That will not be the best solution. 那不是最好的解决方案。 After that you can increase performance by re ordering instruction. 之后,您可以通过重新订购指令来提高性能。

Try it with NEON "intrinsics". 尝试使用NEON“内在函数”。 Basically they're C functions that compile down to NEON instructions. 基本上它们是编译为NEON指令的C函数。 The compiler still gets to do all the instruction scheduling, and you get the other boring stuff (moving data about) for free. 编译器仍然可以执行所有指令调度,并且您可以免费获得其他无聊的东西(移动数据)。

It doesn't always work perfectly, but it might be better than trying to hand code it. 它并不总是完美无缺,但它可​​能比尝试手动编码更好。

Look for arm_neon.h . 寻找arm_neon.h

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM