使用 Arm 內聯 GCC 程序集立即加載 16 位（或更大）

Question

注意：為了簡潔起見，這里的例子被簡化了，所以它們不能證明我的意圖。 如果我只是像示例中一樣寫入 memory 位置，那么C 將是最好的方法。 但是，我正在做一些我不能使用 C 的事情，所以請不要僅僅因為這個特定的例子最好保留在 C 中。

我正在嘗試使用值加載寄存器，但我堅持使用 8 位立即數。

我的代碼：

https://godbolt.org/z/8EE45Gerd

#include <cstdint>

void a(uint32_t value) {
    *(volatile uint32_t *)(0x21014) = value;
}

void b(uint32_t value) {
    asm (
        "push ip                                \n\t"
        "mov ip,       %[gpio_out_addr_high]    \n\t"
        "lsl ip,       ip,                   #8 \n\t"
        "add ip,       %[gpio_out_addr_low]     \n\t"
        "lsl ip,       ip,                   #2 \n\t"
        "str %[value], [ip]                     \n\t"
        "pop ip                                 \n\t"
        : 
        : [gpio_out_addr_low]  "I"((0x21014 >> 2)     & 0xff),
          [gpio_out_addr_high] "I"((0x21014 >> (2+8)) & 0xff),
          [value] "r"(value)
    );
}

// adding -march=ARMv7E-M will not allow 16-bit immediate
// void c(uint32_t value) {
//     asm (
//         "mov ip,       %[gpio_out_addr]    \n\t"
//         "str %[value], [ip]                     \n\t"
//         : 
//         : [gpio_out_addr]  "I"(0x1014),
//           [value] "r"(value)
//     );
// } 


int main() {
    a(20);
    b(20);
    return 0;
}

當我編寫 C 代碼（參見a() ）時，它會在 Godbolt 中組裝為：

a(unsigned char):
        mov     r3, #135168
        str     r0, [r3, #20]
        bx      lr

我認為它使用MOV作為偽指令。 當我想在匯編中做同樣的事情時，我可以將值放入某個 memory 位置並使用LDR加載它。 我認為這就是我使用 -march=ARMv7E-M 時如何組裝 C 代碼（ MOV被LDR替換），但是在許多情況下，這對我來說並不實用，因為我會做其他事情。

在 0x21014 地址的情況下，前 2 位為零，因此當我正確移位它時，我可以將這個 18 位數字視為 16 位，這就是我在b()中所做的，但我仍然必須用 8 位立即數傳遞它。 但是，在 Keil 文檔中，我注意到提到了 16 位立即數：

https://www.keil.com/support/man/docs/armasm/armasm_dom1359731146992.htm

https://www.keil.com/support/man/docs/armasm/armasm_dom1361289878994.htm

在 ARMv6T2 及更高版本中，ARM 和 Thumb 指令集包括：

 A MOV instruction that can load any value in the range 0x00000000 to 0x0000FFFF into a register. A MOVT instruction that can load any value in the range 0x0000 to 0xFFFF into the most significant half of a register, without altering

最低有效一半的內容。

我認為我的 CortexM4 應該是 ARMv7E-M 並且應該滿足這個“ARMv6T2 及更高版本”的要求，並且應該能夠使用 16 位立即數。

但是從 GCC 內聯匯編文檔中我沒有看到這樣的提及：

https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html

當我啟用 ARMv7E-M 架構並取消注釋我使用常規“I”立即的c()時，我得到一個編譯錯誤：

<source>: In function 'void c(uint8_t)':
<source>:29:6: warning: asm operand 0 probably doesn't match constraints
   29 |     );
      |      ^
<source>:29:6: error: impossible constraint in 'asm'

所以我想知道有沒有辦法將 16 位立即數與 GCC 內聯匯編一起使用，或者我錯過了什么（這會使我的問題無關緊要）？

附帶問題，是否可以在 Godbolt 中禁用這些偽指令？ 我已經看到它們也與 RISC-V 程序集一起使用，但我更願意查看反匯編的真實字節碼，以了解這些偽/宏程序集指令產生的確切指令。

Answer 1

@Jester 在評論中建議使用i約束傳遞更大的立即數或使用真正的 C 變量，用所需的值初始化它並讓內聯匯編接受它。 這聽起來像是最好的解決方案，在內聯匯編中花費的時間越少越好，人們想要更好的性能往往低估了 C/C++ 工具鏈在給定正確代碼時的優化能力，對於許多重寫 C/C++ 代碼是回答而不是在匯編中重做所有事情。 @Peter Cordes 提到不要使用內聯匯編，我同意。 然而，在這種情況下，某些指令的確切時序至關重要，我不能冒險讓工具鏈稍微不同地優化某些指令的時序。

Bit-banging 協議並不理想，在大多數情況下，答案是避免 bit-banging，但在我的情況下，它並不是那么簡單，其他方法也不起作用：

SPI 不能用於 stream 數據，因為我需要推送更多信號，並且具有任意長度，而我的硬件僅支持 8 位/16 位。
嘗試使用 DMA2GPIO 並遇到抖動問題。
嘗試過 IRQ 處理程序，它的開銷太大並且我的性能下降（如下所示，只有 2 個 nop，因此空閑時間沒有太多空間可做）。
嘗試預烘焙 stream 位（包括時序），但是對於 1 字節的真實數據，我最終保存了 64 字節的 stream 數據，並且從 ZCD69B4957F06CD818D7BF3D61 讀取的整體速度要慢得多。
每個寫入值的預支持函數（並且對於每個寫入值都有一個函數查找表）工作得非常好，實際上太快了，因為現在工具鏈具有編譯時已知值並且能夠很好地優化它，我的 TCK高於40MHz。問題是我必須添加很多延遲才能將其減慢到所需的速度（8MHz），並且必須為每個輸入值完成，當長度為 8 位或更短時它很好，但是對於 32-位長度無法適應 flash memory (2^32 => 4294967296) 並且將單個 32 位訪問拼接到四個 8 位訪問中會在 TCK 信號上引入大量抖動。
在 FPGA 結構中實現這個外設可以讓我控制一切，通常這是正確的答案，但想嘗試在沒有結構的設備上實現它。

長話短說，bit-banging 是不好的，而且大多數情況下有更好的方法來解決它，而使用內聯匯編的不必要實際上可能會在不知不覺中產生更糟糕的結果，但就我而言，我需要它。 在我之前的代碼中，我試圖專注於一個關於立即數的簡單問題，而不是 go 進入切線或 XY 問題討論。

現在回到“將更大的立即數傳遞給程序集”的主題，這是一個更真實的示例的實現：

https://godbolt.org/z/5vbb7PPP5

#include <cstdint>

const uint8_t TCK = 2;
const uint8_t TMS = 3;
const uint8_t TDI = 4;
const uint8_t TDO = 5;

template<uint8_t number>
constexpr uint8_t powerOfTwo() {
    static_assert(number <8, "Output would overflow, the JTAG pins are close to base of the register and you shouldn't need PIN8 or above anyway");
    int ret = 1;
    for (int i=0; i<number; i++) {
        ret *= 2;
    }
    return ret;
}

template<uint8_t WHAT_SIGNAL>
__attribute__((optimize("-Ofast")))
uint32_t shiftAsm(const uint32_t length, uint32_t write_value) {
    uint32_t addressWrite = 0x40021014; // ODR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)
    uint32_t addressRead  = 0x40021010; // IDR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)

    uint32_t count     = 0;
    uint32_t shift_out = 0;
    uint32_t shift_in  = 0;
    uint32_t ret_value = 0;

    asm volatile (
    "cpsid if                                                  \n\t"  // Disable IRQ
    "repeatForEachBit%=:                                       \n\t"

    // Low part of the TCK
    "and.w %[shift_out],   %[write_value],    #1               \n\t"  // shift_out = write_value & 1
    "lsls  %[shift_out],   %[shift_out],      %[write_shift]   \n\t"  // shift_out = shift_out << pin_shift
    "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out

    // On the first cycle this is redundant, as it processed the shift_in from the previous iteration.
    // First iteration is safe to do extraneously as it's just doing zeros
    "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in

    // Prepare things that are needed toward the end of the loop, but can be done now
    "orr.w %[shift_out],   %[shift_out],      %[clock_mask]    \n\t"  // shift_out = shift_out | (1 << TCK)
    "lsr   %[write_value], %[write_value],    #1               \n\t"  // write_value = write_value >> 1
    "adds  %[count],       #1                                  \n\t"  // count++
    "cmp   %[count],       %[length]                           \n\t"  // if (count != length) then ....

    // High part of the TCK + sample
    "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out
    "nop                                                       \n\t"
    "nop                                                       \n\t"
    "ldr   %[shift_in],    [%[gpio_in_addr]]                   \n\t"  // shift_in = GPIO
    "bne.n repeatForEachBit%=                                  \n\t"  // if (count != length) then  repeatForEachBit

    "cpsie if                                                  \n\t"  // Enable IRQ - the critical part finished

    // Process the shift_in as normally it's done in the next iteration of the loop
    "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in

    // Outputs
    : [ret_value]       "+r"(ret_value),
      [count]           "+r"(count),
      [shift_out]       "+r"(shift_out),
      [shift_in]        "+r"(shift_in)

    // Inputs
    : [gpio_out_addr]   "r"(addressWrite),
      [gpio_in_addr]    "r"(addressRead),
      [length]          "r"(length),
      [write_value]     "r"(write_value),
      [write_shift]     "M"(WHAT_SIGNAL),
      [read_shift]      "M"(TDO),
      [clock_mask]      "I"(powerOfTwo<TCK>())

    // Clobbers
    : "memory"
    );

    return ret_value;
}

int main() {
    shiftAsm<TMS>(7,  0xff);                  // reset the target TAP controler
    shiftAsm<TMS>(3,  0x12);                  // go to state some arbitary TAP state
    shiftAsm<TDI>(32, 0xdeadbeef);            // write to target

    auto ret = shiftAsm<TDI>(16, 0x0000);     // read from the target

    return 0;
}

@David Wohlferd 關於減少組裝的評論將使工具鏈有更多機會進一步優化“將地址加載到寄存器中”，以防內聯它不應該再次加載地址（因此它們只完成一次）多次調用讀/寫）。 這是啟用內聯的：

https://godbolt.org/z/K8GYYqrbq

問題是，值得嗎？ 我想是的，我的 TCK 是死點 8MHz，我的占空比接近 50%，而我對保持原樣的占空比更有信心。 並且采樣是在我期望它完成時完成的，而不用擔心它會因不同的工具鏈設置而得到不同的優化。

使用 Arm 內聯 GCC 程序集立即加載 16 位（或更大）

問題描述

1 個解決方案

解決方案1
3 已采納 2021-05-30 05:48:26

使用 Arm 內聯 GCC 程序集立即加載 16 位（或更大）

問題描述

1 個解決方案

解決方案1 3 已采納 2021-05-30 05:48:26

解決方案1
3 已采納 2021-05-30 05:48:26