In my ATtiny84a AVR Assembly program I end up with a bit number between 0 and 7, in a register, lets say r16. Now I need to create a mask with that bit number set. To make it more complicated, the timing of the operation must be the same, regardless of what bit is set.
For example if r16 = 5 the resulting mask will be 0x20 (bit 5 set).
So far I have shifted a bit into position by LSL and using r16 (the bit number) as a loop counter, then to keep exact timing regardless bit number, do a dummy loop of NOP 8-r16 times.
The assembly instruction SBR sets bit(s) in a register from a mask so it can't be used. The assembly instruction SBI sets a bit in an I/O register from bit number, but it is a constant, not a register (I could have used an I/O register as a temp register).
The mask is then used to clear a bit in a memory location, so if there is another solution to do that from a bit number in a register, then it's fine too.
I have another solution to try out (shift based with carry) but I was hoping that someone have a more elegant solution than loops and shiftings.
I think your hunch with shifts and carries is an elegant solution. You'd basically decrement the index register, set the carry when the decrement was zero, and then shift the carry into the output register.
You can use subtract
to do the decrement, which will automatically set the carry bit when the index hits 0.
You can use a rotate right instead of the shift since this lets you move the bits in the right direction to match the decement.
Then you can get really tricky and use a sentinel bit in the output as a psuedu loop counter to terminate after 8 loop iterations.
So something like...
; Assume r16 is the index 0-7 of the bit to set in the output byte
; Assume r17 is the output byte
; r17 output will be 0 if r16 input is out of bounds
; r16 is clobbered in the process (ends up as r16-8)
ldi r17, 0b10000000 ; Sort of a psuedo-counter. When we see this
; marker bit fall off the right end
; then we know we did 8 bits of rotations
loop:
subi r16,1 ; decrement index by 1, carry will be set if 0
ror r17 ; rotate output right, carry into the high bit
brcc loop ; continue until we see our marker bit come output
I count 4 words (8 bytes) of storage and 24 cycles this operation on all AVRs, so I think winner on size, surprisingly (even to me.) beating out the strong field of lookup-table based entries.
Also features sensible handling of out of bonds conditions and no other registers changed besides the input and output. The repetitive rotates will also help prevent carbon deposit buildup in the ALU shifter gates.
Many thanks to @ReAI and @PeterCordes who's guidance and inspiration made this code possible: :)
9 words, 9 cycles
ldi r17, 1
; 4
sbrc r16, 2 ; if n >= 4
swap r17 ; 00000001 -> 00010000, effectively shift left by 4
; 2
sbrc r16, 1
lsl r17
sbrc r16, 1
lsl r17
; 1
sbrc r16, 0
lsl r17
Since your output has only 8 variants you can use a lookup table. It will do exact the same operations whatever input is thus having exact the same execution time.
ldi r30, low(shl_lookup_table * 2) // Load the table address into register Z
ldi r31, high(shl_lookup_table * 2)
clr r1 // Make zero
add r30, r16 // Add our r16 to the address
adc r31, r1 // Add zero with carry to the upper half of Z
lpm r17, Z // Load a byte from program memory into r17
ret // assuming we are in a routine, i.e. call/rcall was performed
...
shl_lookup_table:
.db 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80
An 8-byte aligned lookup-table simplifies indexing should be good for AVR chips that support lpm
- Load from Program Memory. (Optimized from @AterLux's answer). Aligning the table by 8 means all 8 entries have the same high byte of their address. And no wrapping of the low 3 bits so we can use ori
instead of having to negate the address for subi
. ( adiw
only works for 0..63 so might not be able to represent an address.)
I'm showing the best-case scenario where you can conveniently generate the input in r30
(low half of Z) in the first place, otherwise you need a mov
. Also, this becomes too short to be worth calling a function so I'm not showing a ret
, just a code fragment.
Assumes input is valid (in 0..7); consider @ReAl's if you need to ignore high bits, or just andi r30, 0x7
If you can easily reload Z after this, or didn't need it preserved anyway, this is great. If clobbering Z sucks, you could consider building the table in RAM during initial startup (with a loop) so you could use X or Y for the pointer with a data load instead of lpm
. Or if your AVR doesn't support lpm
.
## gas / clang syntax
### Input: r30 = 0..7 bit position
### Clobbers: r31. (addr of a 256-byte chunk of program memory where you might have other tables)
### Result: r17 = 1 << r30
ldi r31, hi8(shl_lookup_table) // Same high byte for all table elements. Could be hoisted out of a loop
ori r30, lo8(shl_lookup_table) // Z = table | bitpos = &table[bitpos] because alignment
lpm r17, Z
.section .rodata
.p2align 3 // 8-byte alignment so low 3 bits of addresses match the input.
// ideally place it where it will be aligned by 256, and drop the ORI
// but .p2align 8 could waste up to 255 bytes of space! Use carefully
shl_lookup_table:
.byte 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80
If you can locate the table at a 256-byte alignment boundary, you can drop the lo8(table)
= 0 so you can drop the ori
and just use r30
directly as the low byte of the address.
Costs for the version with ori
, not including reloading Z
with something after, or worse saving/restoring Z
. (If Z is precious at the point you need this, consider a different strategy).
In a loop, of if you need other data in the same 256B chunk of program memory, the ldi r31, hi8
can be hoisted / done only once.
If you can align the table by 256, that saves a word of code and a cycle of time. If you also hoist the ldi
out of the loop, that leave just the 3-cycle lpm
.
(Untested, I don't have an AVR toolchain other than clang -target avr
. I think GAS / clang want just normal symbol references, and handle the symbol * 2
internally. This does assemble successfully with clang -c -target avr -mmcu=atmega128 shl.s
, but disassembling the.o crashes llvm-objdump -d
10.0.0.)
Thank you all for your creative answers, but I went with the lookup table as a macro. I find this being the most flexible solution because I can easily have different lookup tables for various purposes at a fixed 7 cycles.
; @0 mask table
; @1 bit register
; @2 result register
.MACRO GetMask
ldi ZL,low(@0)
ldi ZH,high(@0)
add ZL,@1
adc ZH,ZERO
lpm @2,Z
.ENDM
bitmask_lookup:
.DB 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80
inverse_lookup:
.DB ~0x01,~0x02,~0x04,~0x08,~0x10,~0x20,~0x40,~0x80
lrl2_lookup:
.DB 0x04,0x08,0x10,0x20,0x40,0x80,0x01,0x02
ldi r16,2
GetMask bitmask_lookup, r16, r1 ; gives r1 = 0b00000100
GetMask inverse_lookup, r16, r2 ; gives r2 = 0b11111011
GetMask lrl2_lookup, r16, r3 ; gives r3 = 0b00010000 (left rotate by 2)
Space is not so much of an issue, but speed is. However, I think this is a good compromise and I'm not forced to align data on quadwords. 7 vs 5 cycles is the price to pay.
I already have one "ZERO" register reserved through the whole program so it costs me nothing extra to do the 16bit addition.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.