简体   繁体   中英

AVR assembly - bit number to mask

In my ATtiny84a AVR Assembly program I end up with a bit number between 0 and 7, in a register, lets say r16. Now I need to create a mask with that bit number set. To make it more complicated, the timing of the operation must be the same, regardless of what bit is set.

For example if r16 = 5 the resulting mask will be 0x20 (bit 5 set).

So far I have shifted a bit into position by LSL and using r16 (the bit number) as a loop counter, then to keep exact timing regardless bit number, do a dummy loop of NOP 8-r16 times.

The assembly instruction SBR sets bit(s) in a register from a mask so it can't be used. The assembly instruction SBI sets a bit in an I/O register from bit number, but it is a constant, not a register (I could have used an I/O register as a temp register).

The mask is then used to clear a bit in a memory location, so if there is another solution to do that from a bit number in a register, then it's fine too.

I have another solution to try out (shift based with carry) but I was hoping that someone have a more elegant solution than loops and shiftings.

I think your hunch with shifts and carries is an elegant solution. You'd basically decrement the index register, set the carry when the decrement was zero, and then shift the carry into the output register.

You can use subtract to do the decrement, which will automatically set the carry bit when the index hits 0.

You can use a rotate right instead of the shift since this lets you move the bits in the right direction to match the decement.

Then you can get really tricky and use a sentinel bit in the output as a psuedu loop counter to terminate after 8 loop iterations.

So something like...

; Assume r16 is the index 0-7 of the bit to set in the output byte
; Assume r17 is the output byte
; r17 output will be 0 if r16 input is out of bounds
; r16 is clobbered in the process (ends up as r16-8)

ldi r17, 0b10000000 ; Sort of a psuedo-counter. When we see this 
                    ; marker bit fall off the right end
                    ; then we know we did 8 bits of rotations

loop:
subi r16,1  ; decrement index by 1, carry will be set if 0
ror r17     ; rotate output right, carry into the high bit
brcc loop   ; continue until we see our marker bit come output

I count 4 words (8 bytes) of storage and 24 cycles this operation on all AVRs, so I think winner on size, surprisingly (even to me.) beating out the strong field of lookup-table based entries.

Also features sensible handling of out of bonds conditions and no other registers changed besides the input and output. The repetitive rotates will also help prevent carbon deposit buildup in the ALU shifter gates.

Many thanks to @ReAI and @PeterCordes who's guidance and inspiration made this code possible: :)

9 words, 9 cycles

ldi r17, 1

; 4
sbrc    r16, 2  ; if n >= 4
swap    r17     ; 00000001 -> 00010000, effectively shift left by 4

; 2
sbrc    r16, 1
lsl     r17
sbrc    r16, 1
lsl     r17

; 1
sbrc    r16, 0
lsl     r17

Since your output has only 8 variants you can use a lookup table. It will do exact the same operations whatever input is thus having exact the same execution time.

  ldi r30, low(shl_lookup_table * 2) // Load the table address into register Z
  ldi r31, high(shl_lookup_table * 2)

  clr r1 // Make zero

  add r30, r16 // Add our r16 to the address
  adc r31, r1  // Add zero with carry to the upper half of Z

  lpm r17, Z // Load a byte from program memory into r17

  ret // assuming we are in a routine, i.e. call/rcall was performed

...

shl_lookup_table:
  .db 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80

An 8-byte aligned lookup-table simplifies indexing should be good for AVR chips that support lpm - Load from Program Memory. (Optimized from @AterLux's answer). Aligning the table by 8 means all 8 entries have the same high byte of their address. And no wrapping of the low 3 bits so we can use ori instead of having to negate the address for subi . ( adiw only works for 0..63 so might not be able to represent an address.)

I'm showing the best-case scenario where you can conveniently generate the input in r30 (low half of Z) in the first place, otherwise you need a mov . Also, this becomes too short to be worth calling a function so I'm not showing a ret , just a code fragment.

Assumes input is valid (in 0..7); consider @ReAl's if you need to ignore high bits, or just andi r30, 0x7

If you can easily reload Z after this, or didn't need it preserved anyway, this is great. If clobbering Z sucks, you could consider building the table in RAM during initial startup (with a loop) so you could use X or Y for the pointer with a data load instead of lpm . Or if your AVR doesn't support lpm .

## gas / clang syntax
### Input:    r30 = 0..7 bit position
### Clobbers: r31.  (addr of a 256-byte chunk of program memory where you might have other tables)
### Result:   r17 = 1 << r30

  ldi   r31, hi8(shl_lookup_table)    // Same high byte for all table elements.  Could be hoisted out of a loop
  ori   r30, lo8(shl_lookup_table)    // Z = table | bitpos  = &table[bitpos] because alignment

  lpm   r17, Z

.section .rodata
.p2align 3        // 8-byte alignment so low 3 bits of addresses match the input.
           // ideally place it where it will be aligned by 256, and drop the ORI
           // but .p2align 8 could waste up to 255 bytes of space!  Use carefully
shl_lookup_table:
  .byte 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80

If you can locate the table at a 256-byte alignment boundary, you can drop the lo8(table) = 0 so you can drop the ori and just use r30 directly as the low byte of the address.

Costs for the version with ori , not including reloading Z with something after, or worse saving/restoring Z . (If Z is precious at the point you need this, consider a different strategy).

  • size = 3 words code + 8 bytes (4 words) data = 7 words . (Plus up to 7 bytes of padding for alignment if you aren't careful about layout of program memory)
  • cycles = 1(ldi) + 1(ori) + 3(lpm) = 5 cycles

In a loop, of if you need other data in the same 256B chunk of program memory, the ldi r31, hi8 can be hoisted / done only once.

If you can align the table by 256, that saves a word of code and a cycle of time. If you also hoist the ldi out of the loop, that leave just the 3-cycle lpm .

(Untested, I don't have an AVR toolchain other than clang -target avr . I think GAS / clang want just normal symbol references, and handle the symbol * 2 internally. This does assemble successfully with clang -c -target avr -mmcu=atmega128 shl.s , but disassembling the.o crashes llvm-objdump -d 10.0.0.)

Thank you all for your creative answers, but I went with the lookup table as a macro. I find this being the most flexible solution because I can easily have different lookup tables for various purposes at a fixed 7 cycles.

; @0 mask table
; @1 bit register
; @2 result register
.MACRO GetMask
    ldi     ZL,low(@0)
    ldi     ZH,high(@0)
    add     ZL,@1
    adc     ZH,ZERO
    lpm     @2,Z
.ENDM

bitmask_lookup:
    .DB 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80
inverse_lookup:
    .DB ~0x01,~0x02,~0x04,~0x08,~0x10,~0x20,~0x40,~0x80
lrl2_lookup:
    .DB 0x04,0x08,0x10,0x20,0x40,0x80,0x01,0x02

ldi r16,2
GetMask bitmask_lookup, r16, r1 ; gives r1 = 0b00000100
GetMask inverse_lookup, r16, r2 ; gives r2 = 0b11111011
GetMask lrl2_lookup,    r16, r3 ; gives r3 = 0b00010000 (left rotate by 2)

Space is not so much of an issue, but speed is. However, I think this is a good compromise and I'm not forced to align data on quadwords. 7 vs 5 cycles is the price to pay.

I already have one "ZERO" register reserved through the whole program so it costs me nothing extra to do the 16bit addition.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM