Loading 16-bit (or bigger) immediate with a Arm inline GCC assembly

Question

Note: Just here for the brevity the examples are simplified, so they do not justify my intentions. If I would be just writing to a memory location exactly like as in the example, then the C would be the best approach. However, I'm doing stuff where I can't use C so please do not downvote just because this specific example would be best to keep in C.

I'm trying to load registers with values, but I'm stuck to using 8-bit immediates.

My code:

https://godbolt.org/z/8EE45Gerd

#include <cstdint>

void a(uint32_t value) {
    *(volatile uint32_t *)(0x21014) = value;
}

void b(uint32_t value) {
    asm (
        "push ip                                \n\t"
        "mov ip,       %[gpio_out_addr_high]    \n\t"
        "lsl ip,       ip,                   #8 \n\t"
        "add ip,       %[gpio_out_addr_low]     \n\t"
        "lsl ip,       ip,                   #2 \n\t"
        "str %[value], [ip]                     \n\t"
        "pop ip                                 \n\t"
        : 
        : [gpio_out_addr_low]  "I"((0x21014 >> 2)     & 0xff),
          [gpio_out_addr_high] "I"((0x21014 >> (2+8)) & 0xff),
          [value] "r"(value)
    );
}

// adding -march=ARMv7E-M will not allow 16-bit immediate
// void c(uint32_t value) {
//     asm (
//         "mov ip,       %[gpio_out_addr]    \n\t"
//         "str %[value], [ip]                     \n\t"
//         : 
//         : [gpio_out_addr]  "I"(0x1014),
//           [value] "r"(value)
//     );
// } 


int main() {
    a(20);
    b(20);
    return 0;
}

When I write a C code (see a() ) then it gets assembled in Godbolt to:

a(unsigned char):
        mov     r3, #135168
        str     r0, [r3, #20]
        bx      lr

I think it uses the MOV as pseudo instruction. When I want to do the same in assembly, I could put the value into some memory location and load it with LDR . I think that's how the C code gets assembled when I use -march=ARMv7E-M (the MOV gets replaced with LDR ), however in many cases this will not be practical for me as I will be doing other things with.

In the case of the 0x21014 address, the first 2 bits are zero so I can treat this 18-bit number as 16-bit when I shift it correctly, that's what I'm doing in the b() , but still I have to pass it with 8-bit immediates. However, in the Keil documentation I noticed mention of 16-bit immediates:

https://www.keil.com/support/man/docs/armasm/armasm_dom1359731146992.htm

https://www.keil.com/support/man/docs/armasm/armasm_dom1361289878994.htm

In ARMv6T2 and later, both ARM and Thumb instruction sets include:
 A MOV instruction that can load any value in the range 0x00000000 to 0x0000FFFF into a register. A MOVT instruction that can load any value in the range 0x0000 to 0xFFFF into the most significant half of a register, without altering
the contents of the least significant half.

I think my CortexM4 should be ARMv7E-M and should meet this "ARMv6T2 and later" requirement and should be able to use 16-bit immediates.

However from GCC inline assembly documentation I do not see such mention:

https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html

And when I enable the ARMv7E-M arch and uncomment the c() where I use the regular "I" immediate then I get a compilation error:

<source>: In function 'void c(uint8_t)':
<source>:29:6: warning: asm operand 0 probably doesn't match constraints
   29 |     );
      |      ^
<source>:29:6: error: impossible constraint in 'asm'

So I wonder is there a way to use 16-bit immediates with GCC inline assembly, or am I missing something (that would make my question irrelevant)?

Side question, is it possible to disable in the Godbolt these pseudo instructions? I have seen they are used with the RISC-V assembly as well, but I would prefer to see disassembled real bytecode to see what exact instructions these pseudo/macro assembly instructions resulted.

Answer 1

@Jester in the comments recommended either to use i constrain to pass larger immediates or use real C variable, initialize it with desired value and let the inline assembly take it. This sounds like the best solution, the least time spent in the inline assembly the better, people wanting better performance often underestimate how powerful the C/C++ toolchain can be at optimizing when given correct code and for many rewriting the C/C++ code is the answer instead of redoing everything in assembly. @Peter Cordes mentioned to not use inline assembly and I concur. However in this case the exact timing of some instructions was critical and I couldn't risk the toolchain slightly differently optimize the timing of some instructions.

Bit-banging protocols is not ideal, and in most cases the answer is to avoid bit-banging, however in my case it's not that simple and other approaches didn't work:

SPI couldn't be used to stream the data as I needed to push more signals, and have arbitrary lengths, while my HW supported only 8-bit/16-bit.
Tried to use DMA2GPIO and had issues with jitter.
Tried IRQ handler, which is too much overhead and my performance dropped (as you see below there are only 2 nops, so not much space to do in the free time).
Tried pre-baking stream of bits (including the timing), however for 1 byte of real data I had ended up saving 64bytes of stream data and overall reading from memory so much was much slower.
Pre-backing functions for each write value (and having a lookup table of functions, for each value write) worked very well, actually too fast because now the toolchain had compile-time known values and was able to optimize it very well, my TCK was above 40MHz. The problem was that I had to add a lot of delays to slow it down to desired speed (8MHz) and it had to be done for each input value, when the length was 8-bits or less it was fine, but for 32-bit length it was not possible to fit into the flash memory (2^32 => 4294967296) and splicing single 32-bit access into four 8-bit accesses introduced a lot of jitter on the TCK signal.
Implementing this peripheral in FPGA fabric, would allow me to be in control of everything and typically this is the correct answer, but wanted to try to implement this on a device that has no fabric.

Long story short, bit-banging is bad and mostly there are better ways around it and unecesary using inline assembly might actually produce worse results without knowing, but in my case I needed it. And in my previous code was trying to focus on a simple question about the immediates and not go into tangents or XY problem discussion.

So now back to the topic of 'passing bigger immediates to the assembly', here is the implementation of a much more real-world example:

https://godbolt.org/z/5vbb7PPP5

#include <cstdint>

const uint8_t TCK = 2;
const uint8_t TMS = 3;
const uint8_t TDI = 4;
const uint8_t TDO = 5;

template<uint8_t number>
constexpr uint8_t powerOfTwo() {
    static_assert(number <8, "Output would overflow, the JTAG pins are close to base of the register and you shouldn't need PIN8 or above anyway");
    int ret = 1;
    for (int i=0; i<number; i++) {
        ret *= 2;
    }
    return ret;
}

template<uint8_t WHAT_SIGNAL>
__attribute__((optimize("-Ofast")))
uint32_t shiftAsm(const uint32_t length, uint32_t write_value) {
    uint32_t addressWrite = 0x40021014; // ODR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)
    uint32_t addressRead  = 0x40021010; // IDR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)

    uint32_t count     = 0;
    uint32_t shift_out = 0;
    uint32_t shift_in  = 0;
    uint32_t ret_value = 0;

    asm volatile (
    "cpsid if                                                  \n\t"  // Disable IRQ
    "repeatForEachBit%=:                                       \n\t"

    // Low part of the TCK
    "and.w %[shift_out],   %[write_value],    #1               \n\t"  // shift_out = write_value & 1
    "lsls  %[shift_out],   %[shift_out],      %[write_shift]   \n\t"  // shift_out = shift_out << pin_shift
    "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out

    // On the first cycle this is redundant, as it processed the shift_in from the previous iteration.
    // First iteration is safe to do extraneously as it's just doing zeros
    "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in

    // Prepare things that are needed toward the end of the loop, but can be done now
    "orr.w %[shift_out],   %[shift_out],      %[clock_mask]    \n\t"  // shift_out = shift_out | (1 << TCK)
    "lsr   %[write_value], %[write_value],    #1               \n\t"  // write_value = write_value >> 1
    "adds  %[count],       #1                                  \n\t"  // count++
    "cmp   %[count],       %[length]                           \n\t"  // if (count != length) then ....

    // High part of the TCK + sample
    "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out
    "nop                                                       \n\t"
    "nop                                                       \n\t"
    "ldr   %[shift_in],    [%[gpio_in_addr]]                   \n\t"  // shift_in = GPIO
    "bne.n repeatForEachBit%=                                  \n\t"  // if (count != length) then  repeatForEachBit

    "cpsie if                                                  \n\t"  // Enable IRQ - the critical part finished

    // Process the shift_in as normally it's done in the next iteration of the loop
    "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in

    // Outputs
    : [ret_value]       "+r"(ret_value),
      [count]           "+r"(count),
      [shift_out]       "+r"(shift_out),
      [shift_in]        "+r"(shift_in)

    // Inputs
    : [gpio_out_addr]   "r"(addressWrite),
      [gpio_in_addr]    "r"(addressRead),
      [length]          "r"(length),
      [write_value]     "r"(write_value),
      [write_shift]     "M"(WHAT_SIGNAL),
      [read_shift]      "M"(TDO),
      [clock_mask]      "I"(powerOfTwo<TCK>())

    // Clobbers
    : "memory"
    );

    return ret_value;
}

int main() {
    shiftAsm<TMS>(7,  0xff);                  // reset the target TAP controler
    shiftAsm<TMS>(3,  0x12);                  // go to state some arbitary TAP state
    shiftAsm<TDI>(32, 0xdeadbeef);            // write to target

    auto ret = shiftAsm<TDI>(16, 0x0000);     // read from the target

    return 0;
}

@David Wohlferd comment about making less assembly will give more chances for the toolchain to optimize further the 'load of addresses into the registers', in case of inlining it shouldn't load the addresses again (so they are done only once yet there are multiple invocations of reads/writes). Here is inlining enabled:

https://godbolt.org/z/K8GYYqrbq

And the question, was it worth it? I think yes, my TCK is dead spot 8MHz and my duty cycle is close to 50% while I have more confidence about the duty cycle staying as it is. And the sampling is done when I was expecting it to be done and not worry about it getting optimized differently with different toolchain settings.

Loading 16-bit (or bigger) immediate with a Arm inline GCC assembly

Question

1 answers

solution1
3 ACCPTED 2021-05-30 05:48:26

Loading 16-bit (or bigger) immediate with a Arm inline GCC assembly

Question

1 answers

solution1 3 ACCPTED 2021-05-30 05:48:26

solution1
3 ACCPTED 2021-05-30 05:48:26