[英]popcount in arm assembly without neon
I have read this as well as wiki I understand the following code should result in 12 instructions in asm.我已经阅读了这篇文章以及wiki我理解以下代码应该在 asm 中产生 12 条指令。
i = i - ((i >> 1) & 0x55555555); // add pairs of bits
i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // quads
i = (i + (i >> 4)) & 0x0F0F0F0F; // groups of 8
return (i * 0x01010101) >> 24; // horizontal sum of bytes
I am currently solving this problem , and my best correct solution so far executes 190 instructions total in 10 tests (19 per test).我目前正在解决这个问题,到目前为止,我最好的正确解决方案在 10 次测试(每次测试 19 次)中总共执行了 190 条指令。
// A test case to test your function with
.global _start
_start:
mov r0, #5
bl popcount
1: b 1b // Done
// Only your function (starting at popcount) is judged. The test code above is not executed.
popcount:
PUSH {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}
ldr r2, =0x55555555 // 2bits
ldr r3, =0x33333333 // 4bits
ldr r4, =0x0F0F0F0F // 8bits
ldr r6, =0x01010101 // 8bits
//x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
lsr r1, r0, #0x1 // one shift
and r1, r1, r2
sub r0, r0, r1
//x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
lsr r1, r0, #0x2 // two shift
and r1, r1, r3
and r5, r0, r3
add r0, r5, r1
//x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
lsr r1, r0, #0x4 // two shift
add r1, r1, r0
and r0, r1, r4
//return (i * 0x01010101) >> 24;
mul r0, r0, r6
lsr r0, r0, #0x18
POP {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}
bx lr
The best running score so far is 120 instructions.目前最好的跑分是120条指令。 But here the problems are:
但这里的问题是:
These are precisely the 7 extra instructions per test.这些正是每个测试的 7 条额外指令。
How would you proceed to get only 12 instructions executed?你将如何继续只执行 12 条指令?
movw r1, #0x5555
movw r2, #0x3333
movt r1, #0x5555
movt r2, #0x3333
and r3, r0, r1
bic r12, r0, r1
add r0, r3, r12, lsr #1
movw r1, #0x0f0f
and r3, r0, r2
bic r12, r0, r2
add r0, r3, r12, lsr #2
movt r1, #0x0f0f
add r0, r0, r0, lsr #4
mov r2, #0
and r0, r0, r1
usad8 r0, r0, r2
bx lr
and
operations before adding. and
操作。usad8
(unsigned sum of absolute difference 8bits) with zero does the trick adding 4bytes.usad8
(8位绝对差的无符号和)可以增加4字节。 The routine above takes 12 cycles on Cortex-A8 and 15 cycles on the weaker Cortex-A7.上面的例程在 Cortex-A8 上需要 12 个周期,在较弱的 Cortex-A7 上需要 15 个周期。 Scheduling is the key so that as many instructions as possible get dual-issued.
调度是关键,以便尽可能多的指令得到双重发布。
Can be rewritten for 15 instructions like this可以像这样重写 15 条指令
popcount:人口数:
PUSH {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}
//x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
lsr r1, r0, #0x1 // one shift
and r1, r1, #0x55555555
sub r0, r0, r1
//x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
lsr r1, r0, #0x2 // two shift
and r1, r1, #0x33333333
and r5, r0, #0x33333333
add r0, r5, r1
//x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
lsr r1, r0, #0x4 // two shift
add r1, r1, r0
and r0, r1, #0x0F0F0F0F
//return (i * 0x01010101) >> 24;
mul r0, r0, #0x01010101
lsr r0, r0, #0x18
POP {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}
bx lr
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.