简体   繁体   English

popcount in arm 组装没有霓虹灯

[英]popcount in arm assembly without neon

I have read this as well as wiki I understand the following code should result in 12 instructions in asm.我已经阅读了这篇文章以及wiki我理解以下代码应该在 asm 中产生 12 条指令。

 i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
 i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
 i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
 return (i * 0x01010101) >> 24;          // horizontal sum of bytes

I am currently solving this problem , and my best correct solution so far executes 190 instructions total in 10 tests (19 per test).我目前正在解决这个问题,到目前为止,我最好的正确解决方案在 10 次测试(每次测试 19 次)中总共执行了 190 条指令。

// A test case to test your function with
.global _start
_start:
    mov r0, #5
    bl popcount
    1: b 1b    // Done

// Only your function (starting at popcount) is judged. The test code above is not executed.
    
popcount:

    PUSH    {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}
    
    ldr r2, =0x55555555 // 2bits
    ldr r3, =0x33333333 // 4bits
    ldr r4, =0x0F0F0F0F // 8bits
    ldr r6, =0x01010101 // 8bits
    
    
    //x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
    lsr r1, r0, #0x1 // one shift
    and r1, r1, r2
    sub r0, r0, r1

    //x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
    lsr r1, r0, #0x2 // two shift
    and r1, r1, r3
    and r5, r0, r3  
    add r0, r5, r1
    
    //x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
    lsr r1, r0, #0x4 // two shift
    add r1, r1, r0
    and r0, r1, r4

    //return (i * 0x01010101) >> 24;
    mul r0, r0, r6
    lsr r0, r0, #0x18

    POP    {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}
    
    bx lr
    

The best running score so far is 120 instructions.目前最好的跑分是120条指令。 But here the problems are:但这里的问题是:

  1. I need to push/pop not to clobber registers (2 instructions)我需要压入/弹出不要破坏寄存器(2 条指令)
  2. I need the LDR instructions because constants are too big for immediate (4 instructions), also no rotation would work.我需要 LDR 指令,因为常量对于立即数(4 条指令)来说太大了,而且旋转也不起作用。
  3. I need to return from function (1 instruction)我需要从 function 返回(1 条指令)
  4. Neon SIMD instructions are not available (I tried) Neon SIMD 指令不可用(我试过了)

These are precisely the 7 extra instructions per test.这些正是每个测试的 7 条额外指令。



How would you proceed to get only 12 instructions executed?你将如何继续只执行 12 条指令?

movw    r1, #0x5555
movw    r2, #0x3333
movt    r1, #0x5555
movt    r2, #0x3333

and     r3, r0, r1
bic     r12, r0, r1

add     r0, r3, r12, lsr #1
movw    r1, #0x0f0f

and     r3, r0, r2
bic     r12, r0, r2
add     r0, r3, r12, lsr #2
movt    r1, #0x0f0f

add     r0, r0, r0, lsr #4
mov     r2, #0
and     r0, r0, r1
usad8   r0, r0, r2

bx      lr  
  • 4+4 won't spill 4 bits, hence you don't need and operations before adding. 4+4 不会溢出 4 位,因此在添加之前不需要and操作。
  • usad8 (unsigned sum of absolute difference 8bits) with zero does the trick adding 4bytes.零的usad8 (8位绝对差的无符号和)可以增加4字节。

The routine above takes 12 cycles on Cortex-A8 and 15 cycles on the weaker Cortex-A7.上面的例程在 Cortex-A8 上需要 12 个周期,在较弱的 Cortex-A7 上需要 15 个周期。 Scheduling is the key so that as many instructions as possible get dual-issued.调度是关键,以便尽可能多的指令得到双重发布。

Can be rewritten for 15 instructions like this可以像这样重写 15 条指令

popcount:人口数:

PUSH    {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}     

//x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
lsr r1, r0, #0x1 // one shift
and r1, r1, #0x55555555
sub r0, r0, r1

//x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
lsr r1, r0, #0x2 // two shift
and r1, r1, #0x33333333
and r5, r0, #0x33333333
add r0, r5, r1

//x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
lsr r1, r0, #0x4 // two shift
add r1, r1, r0
and r0, r1, #0x0F0F0F0F

//return (i * 0x01010101) >> 24;
mul r0, r0, #0x01010101
lsr r0, r0, #0x18

POP    {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}

bx lr

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM