手臂皮质-m4 循环计数有些奇怪

Question

I recently used a board ( LPCXpresso 5411x ) to do some computation and we tried to decrease cycles as long as we can to save the running time for our certain demand, so I needed to do some research on how cortex-m4 instructions cost cycles.我最近使用了一块板（ LPCXpresso 5411x ）来做一些计算，我们试图尽可能地减少周期，以节省我们特定需求的运行时间，所以我需要对 cortex-m4 指令如何消耗周期做一些研究。 And I've found many things weird (couldn't be explained by what I've found from the internet)而且我发现了很多奇怪的东西（无法用我从互联网上找到的东西来解释）

I used DWT->CYCCNT to count cycles consumed by a function I want to test.我使用DWT->CYCCNT来计算我要测试的函数消耗的周期数。

int start_cycle, end_cycle;

__asm volatile (
  "LDR %[s1], [%[a]], #0\n\t"
  :[s1] "=&r"(start_cycle): [a] "r"(&(DWT->CYCCNT)):);

AddrSumTest();
__asm volatile (
  "LDR %[s1], [%[a]], #0\n\t"
  :[s1] "=&r"(end_cycle): [a] "r"(&(DWT->CYCCNT)):);

printf("inside the func() cycles: %d\n",end_cycle - start_cycle);

Here is how my function is defined:这是我的函数的定义方式：

__attribute__( ( always_inline )) static inline void AddrSumTest(){
    uint32_t x, y, i, q;

    __asm volatile (
        "nop\n\t"
        :[x] "=r" (x), [y] "=r" (y), [i] "=r" (i), [q] "=r" (q):);
    }
}

According to Arm Infocenter , the instruction MOV should cost one cycle, but I've found that根据Arm Infocenter ，指令MOV应该花费一个周期，但我发现

the following instructions cost 8 cycles(not 3 because extra cycles are needed to read from DWT->CYCCNT )以下指令需要 8 个周期（不是 3 个，因为从DWT->CYCCNT读取需要额外的周期）

  "nop\n\t"
  "MOV %[x], #2\n\t"
  "nop\n\t"

after adding another MOV instruction, 10 cycles are needed for the following cycles(why not 9 cycles)添加另一条MOV指令后，接下来的周期需要10个周期（为什么不是9个周期）

  "nop\n\t"
  "MOV %[x], #2\n\t"
  "MOV %[y], #3\n\t"
  "nop\n\t"

and the assembly codes for the latter situation are后一种情况的汇编代码是

4000578:    f853 4b00   ldr.w   r4, [r3], #0
400057c:    bf00        nop
400057e:    f04f 0502   mov.w   r5, #2
4000582:    f04f 0603   mov.w   r6, #3
4000586:    bf00        nop
4000588:    f853 1b00   ldr.w   r1, [r3], #0
400058c:    4805        ldr r0, [pc, #20]   ;(40005a4<test_AddrSum+0x30>)
400058e:    1b09        subs    r1, r1, r4
4000590:    f000 f80e   bl  40005b0 <__printf_veneer>

The two ldrs are reading from DWT->CYCCNT, besides, it's also strange why this would cost 10 cycles, and what I estimate is 2(from ldr) + 4 = 6这两个 ldr 正在从 DWT->CYCCNT 读取，此外，为什么这会花费 10 个周期也很奇怪，我估计是 2（来自 ldr）+ 4 = 6

By the way, the board doesn't have any cache, and I store codes in sramx and stack is in sram2.顺便说一下，板子没有任何缓存，我将代码存储在 sramx 中，堆栈在 sram2 中。

Do I miss something and it there any way I can figure out how every cycle is consumed?我是否错过了什么，有什么办法可以弄清楚每个周期是如何消耗的？ Besides, I'm also confused with data dependency of cortex-m4.此外，我也对 cortex-m4 的数据依赖性感到困惑。

Answer 1

taking a variation and I don't have that chip but have others.采取一个变化，我没有那个芯片，但有其他的。 in this case using a ti cortex-m4.在这种情况下使用 ti cortex-m4。 the st parts have this cache in front the flash, that I don't think you can turn off and (as designed) affects performance. st 部分在闪存前面有这个缓存，我认为你不能关闭它并且（按照设计）影响性能。

00000082 <test>:
  82:   f3bf 8f4f   dsb sy
  86:   f3bf 8f6f   isb sy
  8a:   6802        ldr r2, [r0, #0]
  8c:   46c0        nop         ; (mov r8, r8)
  8e:   46c0        nop         ; (mov r8, r8)
  90:   46c0        nop         ; (mov r8, r8)
  92:   46c0        nop         ; (mov r8, r8)
  94:   46c0        nop         ; (mov r8, r8)
  96:   46c0        nop         ; (mov r8, r8)
  98:   f240 0102   movw    r1, #2
  9c:   f240 0103   movw    r1, #3
  a0:   46c0        nop         ; (mov r8, r8)
  a2:   46c0        nop         ; (mov r8, r8)
  a4:   46c0        nop         ; (mov r8, r8)
  a6:   46c0        nop         ; (mov r8, r8)
  a8:   46c0        nop         ; (mov r8, r8)
  aa:   46c0        nop         ; (mov r8, r8)
  ac:   46c0        nop         ; (mov r8, r8)
  ae:   6803        ldr r3, [r0, #0]
  b0:   1ad0        subs    r0, r2, r3
  b2:   4770        bx  lr

So without the second movw it takes 0x11 clocks in flash, and between 0x10 and 0x11 in ram depending on alignment.因此，如果没有第二个 movw，则闪存中需要 0x11 个时钟，而内存中则需要 0x10 和 0x11 之间，具体取决于对齐方式。 When the thumb2 instruction is aligned on a word boundary, it takes a clock longer than when unaligned.当 thumb2 指令在字边界上对齐时，它需要比未对齐时更长的时钟。

using the thumb instruction 0x2102使用拇指指令 0x2102

00000000 20001016 00000010 
00000002 20001018 00000010 
00000004 2000101A 00000010 
00000006 2000101C 00000010

using the thumb2 extension 0xf240, 0x0102使用 thumb2 扩展名 0xf240, 0x0102

00000000 20001016 00000010 
00000002 20001018 00000011 
00000004 2000101A 00000010 
00000006 2000101C 00000011

using the thumb2 extensions 0xf240, 0x0102, 0xf240, 0x0103使用 thumb2 扩展 0xf240、0x0102、0xf240、0x0103

00000000 20001016 00000012 
00000002 20001018 00000013 
00000004 2000101A 00000012 
00000006 2000101C 00000013

And this is not really a surprise, likely has to do with fetching.这并不是一个真正的惊喜，可能与获取有关。 These microcontrollers are much simpler than the full sized arms.这些微控制器比全尺寸臂简单得多。 The full sized will fetch say 8 instructions per fetch, and depending on where things lie in the fetch line can affect performance, moreso with loops and where the branch lies in the fetch line (doesn't matter if the cache is on or off).完整大小的每次提取将提取 8 条指令，并且取决于提取行中的内容可能会影响性能，循环和分支位于提取行中的位置更是如此（缓存打开或关闭无关紧要） . Branches also have branch predictors you can turn on and off and can vary in design.分支还具有分支预测器，您可以打开和关闭它们，并且可以在设计上有所不同。

This particular chip says that above 40Mhz it enables a prefetch that fetches one word, implying that below it fetches one halfword (the bus is likely a word wide so reads the same address twice to get the two instructions there...why?)这个特定的芯片说，在 40Mhz 以上它可以预取一个字，这意味着在它下面提取一个半字（总线可能是一个字宽，所以读取相同的地址两次以获取那里的两条指令......为什么？）

Other chips (cortex-ms as well as others) you have to control the wait states on the flash, sometimes the flash is half the speed of the ram and the same code, same machine code, runs faster on ram even at low speeds and only gets worse as you increase the clock and increase the number of wait states on the flash to keep its speed in check.其他芯片（cortex-ms 以及其他芯片）您必须控制闪存上的等待状态，有时闪存是 ram 速度的一半，并且相同的代码、相同的机器代码即使在低速下也能在 ram 上运行得更快，并且当您增加时钟并增加闪存上的等待状态数量以控制其速度时，只会变得更糟。

The ST family in particular has some marketing term for a prefetch cache thing they put in you cant disable.特别是 ST 系列有一些营销术语，用于描述他们放入的无法禁用的预取缓存。 You can do a dsb/isb just before the code under test and for example see the affects of wait states for a single pass, but if doing a test loop您可以在被测代码之前执行 dsb/isb，例如查看单次传递的等待状态的影响，但如果执行测试循环

test_loop: sub r3,#1
bne test_loop

and running it a lot of times those few clocks at the beginning are reflectied but small, just like using a cache, but you should still see fetch line effects against a cache if the processor lets you see those.并多次运行它，开始时的那几个时钟被反射但很小，就像使用缓存一样，但是如果处理器允许您看到这些，您仍然应该看到对缓存的提取线效果。

Some chips have a flash prefetch you can enable or disable, which particularly with loops can hurt performance rather than help if you align things just right such that the prefetcher is reading well past the end of the loop.某些芯片具有您可以启用或禁用的闪存预取，特别是对于循环而言，如果您将事物对齐得恰到好处，从而使预取器在循环结束后读取良好，则会损害性能而不是帮助。

ARM ip stops at the arm busses on the edge of the core (AXI,AMBA,AHB,APB,whatever), in general you might have ARM ip for an L2 cache (not in one of these microcontrollers) and you may buy some arm ip to help you with their bus, but eventually the chip has chip specific stuff in it, which arm has nothing to do with and is not consistent from chip vendor to chip vendor, in particular the flash and the sram interfaces. ARM ip 停在内核边缘的 arm 总线上（AXI、AMBA、AHB、APB 等等），一般来说，您可能有用于 L2 缓存的 ARM ip（不在这些微控制器中），并且您可以购买一些 arm ip 来帮助您处理他们的总线，但最终芯片中有芯片特定的东西，这与芯片供应商无关，并且芯片供应商之间不一致，特别是闪存和 sram 接口。

There is first off no reason to expect predictable results with a pipelined processor, as shown above, and really easy to show with a two instruction loop, the same machine code can vary widely in performance due to alignment alone, but also factors that you are in control of directly or indirectly, flash wait states, the relative speed of the clock vs the flash.首先没有理由期望使用流水线处理器获得可预测的结果，如上所示，并且很容易用两条指令循环显示，相同的机器代码可能会因仅对齐而在性能上有很大差异，但也有一些因素你是直接或间接控制闪存等待状态，即时钟与闪存的相对速度。 If a/the boundary between N and N+1 wait states on our device is at 24Mhz, so 24Mhz at N wait states is much faster than 24Mhz at N+1 wait states.如果我们设备上 N 和 N+1 等待状态之间的边界是 24Mhz，那么 N 等待状态下的 24Mhz 比 N+1 等待状态下的 24Mhz 快得多。 28Mhz (N+1 wait states) is faster than 24Mhz at N+1 wait states, but eventually the cpu clock may overcome the wait state and you can find a cpu speed that outperforms 24Mhz n+1 wait states, as far as overall wall clock timed performance, not cpu clocks being counted, the cpu clocks being counted if affected by the flash wait states should always be affected by the flash wait states. 28Mhz（N+1 等待状态）在 N+1 等待状态下比 24Mhz 快，但最终 cpu 时钟可能会克服等待状态，你可以找到一个超过 24Mhz n+1 等待状态的 CPU 速度，就整体墙而言时钟定时性能，而不是被计数的 CPU 时钟，如果受闪存等待状态影响，被计数的 CPU 时钟应始终受闪存等待状态的影响。

The srams tend to not have wait states and run as fast as the CPU but there are probably exceptions to that. sram 往往没有等待状态并且运行速度与 CPU 一样快，但可能有例外。 No doubt the periperhals have limits, many of the vendors have rules about peripheral clocks, this one cant be above 32mhz even though the part goes to 48, that kind of thing, so a benchmark that accesses a peripheral will take a different number of cpu clocks at different cpu/system speed settings.毫无疑问，外设是有限制的，很多厂商对外设时钟都有规定，这个不能超过32mhz，即使部分到48，那种东西，所以访问外设的基准会占用不同数量的cpu时钟在不同的 CPU/系统速度设置。

You also have configurable options in the processor, basically compile time options.您还可以在处理器中配置选项，基本上是编译时选项。 the cortex-m4 doesn't advertise this but the cortex-m0+ does can be configured for a 16 or 32 bit instruction fetch width. cortex-m4 没有宣传这一点，但 cortex-m0+ 确实可以配置为 16 位或 32 位指令提取宽度。 I don't have visibility to that source code so it may be something that has to be compile time or something that if you choose you can setup a control register and have it runtime configurable, or perhaps have logic that says if the pll settings are such then force one way, else the other, and so on.我看不到该源代码，因此它可能必须是编译时，或者如果您选择可以设置控制寄存器并使其运行时可配置，或者可能具有说明 pll 设置是否正确的逻辑这样就强制一种方式，否则另一种方式，依此类推。 So even if you have two chips from different vendors with the same rev and model cpu core, that doesnt mean they will behave the same.因此，即使您有两个来自不同供应商的具有相同 rev 和型号 cpu 内核的芯片，这并不意味着它们的行为相同。 Not to mention the chip vendor has the source code and can make modifications.更何况芯片厂商有源代码，可以修改。

So trying to predict cycle counts on a pipelined processor in a system that you don't have visibility into, is not going to happen.因此，在您无法查看的系统中，尝试预测流水线处理器的周期数是不会发生的。 You will have times that you add an extra nop and it gets faster, times where you add one and it gets slower as one would expect and times where it doesn't change.您有时会添加一个额外的 nop 并且它变得更快，有时您添加一个它会像人们预期的那样变慢，有时它不会改变。 And if a nop can do that then any other instruction can as well.如果 nop 可以做到这一点，那么任何其他指令也可以。

Not to mention messing with the pipe itself, these cortex-ms are really short pipes so we are told so forcing a sequence of instructions with a lot of dependencies vs a similar sequence without won't have as big of an affect.更不用说弄乱管道本身了，这些 cortex-ms 是非常短的管道，所以我们被告知如此强制具有大量依赖关系的指令序列与没有类似序列的指令序列不会产生太大的影响。

Take the same machine code under test run it on several cortex-m4s from different vendors (or even cortex-m3s and cortex-m7s as well), flash and ram, with different settings, and there should be no surprise if the execution time in cpu ticks varies.以相同的机器代码在不同厂商的多个 cortex-m4s（甚至 cortex-m3s 和 cortex-m7s）、flash 和 ram 上测试运行它，使用不同的设置，如果执行时间在cpu 滴答声有所不同。

手臂皮质-m4 循环计数有些奇怪

问题描述

1 个解决方案

解决方案1
2 2017-04-15 17:14:56

手臂皮质-m4 循环计数有些奇怪

问题描述

1 个解决方案

解决方案1 2 2017-04-15 17:14:56

解决方案1
2 2017-04-15 17:14:56