简体   繁体   English

ARMv7使用的指令数

[英]Number of instructions used ARMv7

I am trying to figure out how many CPU cycles will be used to execute the delay function 我试图弄清楚将使用多少CPU周期来执行延迟功能

delay:
 subs r0, #1
 bmi end_delay
 b delay
 end_delay:
 bx lr

I feel intuitively that 1 CPU cycle should be used for each instruction, so if we began with r0 =4 it would take 11 CPU cycles to complete the following code is that correct ? 我觉得直觉上每个指令应该使用1个CPU周期,所以如果我们从r0 = 4开始,那么需要11个CPU周期才能完成以下代码是正确的吗?

I feel intuitively that 1 CPU cycle should be used for each instruction, so if we began with r0 =4 it would take 11 CPU cycles to complete the following code is that correct ? 我觉得直觉上每个指令应该使用1个CPU周期,所以如果我们从r0 = 4开始,那么需要11个CPU周期才能完成以下代码是正确的吗?

Given that most of the ARM CPUs have 3-8 pipeline stages it would be difficult to say that most of the instruction would take 1 CPU cycle to complete. 鉴于大多数ARM CPU具有3-8个流水线级,很难说大多数指令需要1个CPU周期才能完成。 Ideally in a pipelined CPU there should one instruction retiring every clock cycle but since the code above has branch statements, this makes it difficult to judge when each instruction retires. 理想情况下,在流水线型CPU中​​,应该有一条指令在每个时钟周期退出,但由于上面的代码具有分支语句,因此很难判断每条指令何时退出。 The reason being that we don't know about how the branches are dealt as this would depend on the Branch Predictor algorithm present in the processor design. 原因是我们不知道如何处理分支将取决于处理器设计中存在的分支预测器算法。 Accordingly, if the prediction is correct there wouldn't be any bubbles inserted in the pipeline but if it is in-correctly predicted then it depends on the internal pipeline structure on how many bubbles would be inserted. 因此,如果预测是正确的,那么管道中不会插入任何气泡,但如果它被正确预测,那么它将取决于内部管道结构将插入多少气泡。 For an ideal 5-stage pipeline there would be 2 bubbles inserted for every mis-prediction. 对于理想的5级流水线,每次误预测都会插入2个气泡。 But again this depends on the internal micro-architecture implementation. 但这又取决于内部微架构的实现。 As a result it would be difficult to accurately predict how many cycles the above code would take. 因此,很难准确地预测上述代码将采用多少周期。

The cortex-m is not the same as a microchip pic chip, (or z80 and some others) you cannot create a predictable delay this way with this instruction set. cortex-m与微芯片pic芯片(或z80和其他一些)不同,你不能用这个指令集以这种方式创建可预测的延迟。 You can insure it will be at OR SLOWER but not right at some amount of time (clocks). 你可以确保它会在一个时间点(时钟)处于或低位但不正常。

0000009c <hello>:
  9c:   3801        subs    r0, #1
  9e:   d1fd        bne.n   9c <hello>

your loop has a branch decision in there, more instructions and more paths basically so the opportunity for execution time to vary gets worse. 你的循环在那里有一个分支决策,更多的指令和更多的路径基本上所以执行时间变化的机会变得更糟。

00000090 <delay>:
  90:   3801        subs    r0, #1
  92:   d400        bmi.n   96 <end_delay>
  94:   e7fc        b.n 90 <delay>

00000096 <end_delay>:

so if we focus on these three instructions. 所以如果我们专注于这三个指令。

some cortex-ms have a build (of the logic) time option of fetching per instruction or per word, the cortex-m4 documentation says: 一些cortex-ms有一个构建(逻辑)时间选项,每个指令或每个单词取,cortex-m4文档说:

All fetches are word-wide. 所有提取都是全字的。

so we hope that halfword alignment wont affect performance. 所以我们希望半字对齐不会影响性能。 with these instructions we dont necessarily expect to see the difference anyway. 根据这些说明,我们不一定希望看到差异。 with a full sized arm the fetches are multiple words so you will definitely see fetch line (size) affects. 对于全尺寸的手臂,提取是多个单词,因此您肯定会看到提取线(大小)影响。

The execution depends heavily on the implementation. 执行在很大程度上取决于实现。 The cortex-m is just the arm core, the rest of the chip is from the chip vendor, purchased IP or built in house or a combination (very likely the latter). cortex-m只是手臂核心,芯片的其余部分来自芯片供应商,购买IP或内置或组合(很可能是后者)。 ARM does not make chips (other than perhaps for validation) they make IP that they sell. ARM不制造芯片(除了可能用于验证),他们制造出他们销售的IP。

The chip vendor determines the flash (and ram) implementation, often with these types of chips the flash speed is at or slower than the cpu speed, meaning it can take two clocks to fetch one instruction which means you never feed the cpu as fast as it can go. 芯片供应商确定闪存(和ram)实现,通常使用这些类型的芯片,闪存速度等于或低于CPU速度,这意味着它可能需要两个时钟来获取一条指令,这意味着你永远不会像cpu那样快速地提供cpu。它可以去。 Some like ST have a cache they put in that you cannot (so far as I know) turn off, so it is hard to see this effect (but still possible), the particular chip I am using for this says: 有些人喜欢ST有一个他们放入的缓存,你不能(据我所知)关闭,所以很难看到这种效果(但仍然可能),我使用的特定芯片说:

8.2.3.1 Prefetch Buffer The Flash memory controller has a prefetch buffer that is automatically used when the CPU frequency is greater than 40 MHz. 8.2.3.1预取缓冲区Flash存储器控制器有一个预取缓冲区,当CPU频率大于40 MHz时自动使用。 In this mode, the Flash memory operates at half of the system clock. 在此模式下,闪存以系统时钟的一半运行。 The prefetch buffer fetches two 32-bit words per clock allowing instructions to be fetched with no wait states while code is executing linearly. 预取缓冲区每个时钟取出两个32位字,允许在代码线性执行时读取没有等待状态的指令。 The fetch buffer includes a branch speculation mechanism that recognizes a branch and avoids extra wait states by not reading the next word pair. 获取缓冲区包括分支推测机制,该机制识别分支并通过不读取下一个字对来避免额外的等待状态。 Also, short loop branches often stay in the buffer. 此外,短循环分支通常留在缓冲区中。 As a result, some branches can be executed with no wait states. 因此,某些分支可以在没有等待状态的情况下执行。 Other branches incur a single wait state. 其他分支机构会产生一个等待状态。

and of course like ST they dont really tell you the whole story. 当然,就像ST一样,他们并没有真正告诉你整个故事。 So we just go in and try this. 所以我们进去尝试一下。 You can use debug timers if you want but the systick runs off the same clock and gives you the same result 如果需要,您可以使用调试计时器,但是systick运行相同的时钟并为您提供相同的结果

00000086 <test>:
  86:   f3bf 8f4f   dsb sy
  8a:   f3bf 8f6f   isb sy
  8e:   680a        ldr r2, [r1, #0]

00000090 <delay>:
  90:   3801        subs    r0, #1
  92:   d400        bmi.n   96 <end_delay>
  94:   e7fc        b.n 90 <delay>

00000096 <end_delay>:
  96:   680b        ldr r3, [r1, #0]
  98:   1ad0        subs    r0, r2, r3
  9a:   4770        bx  lr

So I read the CCR and CPUID 所以我读了CCR和CPUID

00000200 CCR
410FC241 CPUID

just because. 只因为。 then ran the code under test three times 然后运行三次测试代码

00000015
00000015
00000015

these numbers are in hex so that is 21 instructions. 这些数字是十六进制的,因此是21条指令。 same execution time each time so no cache or branch prediction cache effects. 每次执行时间相同,因此没有缓存或分支预测缓存效果。 I didnt see anything related to branch prediction on the cortex-m4 others cortex-ms do have branch prediciton (maybe only the m7). 我没有看到任何与皮质-m4相关的分支预测相关的其他皮质-ms确实有分支预测(可能只有m7)。 I have the I and D cache off, they will of course, along with alignment greatly effect the execution time (and that time can/will vary as your application runs). 我关闭了I和D缓存,它们当然会随着对齐而大大影响执行时间(并且该时间可能会随着应用程序的运行而变化)。

I changed the alignment (add or remove nops in front of this code) 我更改了对齐方式(在此代码前添加或删除nops)

0000008a <delay>:
  8a:   3801        subs    r0, #1
  8c:   d400        bmi.n   90 <end_delay>
  8e:   e7fc        b.n 8a <delay>

and it didnt affect the execution time. 它没有影响执行时间。

AFAIK with this processor we cannot change the flash wait state settings directly it is automatic based on clock settings, so running at a different clock speed, above the 40Mhz mark I get 使用此处理器的AFAIK我们无法直接更改闪存等待状态设置它是基于时钟设置自动运行,因此以不同的时钟速度运行,高于40Mhz标记我得到

0000001E                                                                                         
0000001E                                                                                         
0000001E 

For the same machine code, same alignment 30 clocks now instead of 21. 对于相同的机器代码,现在相同的对齐30个时钟而不是21个。

Normally the ram is faster and no wait state (understand these busses take several clocks per transaction, so it is not like the old days, but there is still a delay you can detect), so running these instructions in ram should tell us something 通常ram更快,没有等待状态(理解这些总线每次事务需要几个时钟,所以它不像过去那样,但是你仍然可以检测到延迟),所以在ram中运行这些指令应该告诉我们一些事情

for(rb=0;rb<0x20;rb+=2)
{

    hexstrings(rb);
    ra=0x20001000+rb;
    PUT16(ra,0x680a); ra+=2;
    hexstrings(ra);
    PUT16(ra,0x3801); ra+=2;
    PUT16(ra,0xd400); ra+=2;
    PUT16(ra,0xe7fc); ra+=2;
    PUT16(ra,0x680b); ra+=2;
    PUT16(ra,0x1ad0); ra+=2;
    PUT16(ra,0x4770); ra+=2;

    PUT16(ra,0x46c0); ra+=2;
    PUT16(ra,0x46c0); ra+=2;
    PUT16(ra,0x46c0); ra+=2;
    PUT16(ra,0x46c0); ra+=2;
    PUT16(ra,0x46c0); ra+=2;
    PUT16(ra,0x46c0); ra+=2;
    hexstring(BRANCHTO(4,STCURRENT,0x20001001+rb)&STMASK);
}

and that certainly gets interesting... 这当然有趣......

00000000 20001002 00000026                                                                       
00000002 20001004 00000020                                                                       
00000004 20001006 00000026                                                                       
00000006 20001008 00000020                                                                       
00000008 2000100A 00000026                                                                       
0000000A 2000100C 00000020                                                                       
0000000C 2000100E 00000026                                                                       
0000000E 20001010 00000020                                                                       
00000010 20001012 00000026                                                                       
00000012 20001014 00000020                                                                       
00000014 20001016 00000026                                                                       
00000016 20001018 00000020                                                                       
00000018 2000101A 00000026                                                                       
0000001A 2000101C 00000020                                                                       
0000001C 2000101E 00000026                                                                       
0000001E 20001020 00000020 

first off it is 32 or 38 clocks, second is there is an alignment effect 首先是32或38个时钟,第二个是对齐效果

The armv7-m CCR shows a branch prediction bit, but the trm and the vendor documentation dont show it, so it could be a generic thing that not all cores support. armv7-m CCR显示了一个分支预测位,但是trm和供应商文档没有显示它,因此它可能是一个通用的东西,并非所有内核都支持。

So for a specific cortex-m4 chip the time to execute your loop is between 21 and 38 clocks, and I could probably make it slower if I wanted to. 因此对于特定的cortex-m4芯片,执行循环的时间在21到38个时钟之间,如果我愿意的话,我可能会让它变慢。 I dont think I could get it down to 11 on this chip though. 我不认为我可以在这个筹码上降到11。

If you are for example doing i2c bit banging you can use something like this for a delay that will work fine, wont be optimal but will work just fine. 如果您正在进行i2c比特敲击,您可以使用类似这样的延迟,它将工作正常,不会是最佳的,但会工作得很好。 If you need something more precise within a window of time at least this but not greater than than then use a timer (and understand polled or interrupt your accuracy will have some error) if the timer peripheral or other can generate the signal you want you can then get down to a clock accurate waveform (if that is what your delay is for). 如果您需要在一个时间窗口内更精确,但不大于此时使用定时器(并理解轮询或中断您的准确性将有一些错误)如果定时器外围设备或其他可以生成您想要的信号然后得到一个时钟准确的波形(如果这是你的延迟)。

another cortex-m4 is expected to have different results, I would expect an stm32 to have the sram be same as or faster than flash, not slower as in this case. 另外一个cortex-m4预计会有不同的结果,我希望stm32能让sram与flash相同或更快,而不是像这种情况那样慢。 And there are settings you can mess with that your init code if you are relying on someone else to setup your chip, that can/will affect execution time. 如果您依靠其他人来设置您的芯片,那么您可以使用初始化代码来处理这些设置,这会影响执行时间。

EDIT 编辑

I dont know where I got the idea this was for a cortex-m4 which is an armv7-m, so I didnt have a raspberry pi 2 handy, but had a pi3, and running in aarch32 mode, 32 bit instructions. 我不知道我在哪里得到了一个cortex-m4这是一个armv7-m的想法,所以我没有一个朴素pi 2方便,但有一个pi3,并运行在aarch32模式,32位指令。 I had no idea how much work this would be to get the timers running and then the cache enabled. 我不知道这会让计时器运行然后启用缓存有多少工作。 The pi runs out of dram which is very inconsistent even with bare metal. pi用尽了dram,即使是裸露的金属也非常不一致。 So I figured I would enable the l1 cache, and after the first run it should be all in cache and consistent. 所以我想我会启用l1缓存,并且在第一次运行之后它应该全部在缓存中并且是一致的。 Now that I think about it there are four cores and each is running, dont know how to disable them the other three are spinning in a loop waiting for a mailbox register to tell them what code to run. 现在我想到它有四个核心,每个都运行,不知道如何禁用它们,其他三个正在循环旋转等待邮箱寄存器告诉他们运行什么代码。 perhaps I need to have them branch somewhere and run out of l1 cache as well...not sure if the l1 is per core or shared, I think I looked that up at one point. 也许我需要将它们分支到某个地方并且用完l1缓存...不确定l1是每个核心还是共享,我想我在某一点看起来。

Anyway code under test 无论如何代码正在测试中

000080c8 <COUNTER>:
    80c8:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}

000080cc <delay>:
    80cc:   e2500001    subs    r0, r0, #1
    80d0:   4a000000    bmi 80d8 <end_delay>
    80d4:   eafffffc    b   80cc <delay>

000080d8 <end_delay>:
    80d8:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
    80dc:   e0430002    sub r0, r3, r2
    80e0:   e12fff1e    bx  lr

and the punch line is for that alignment the first column is the r0 passed, the next three are three runs, the last column if there is the delta from the prior run to current (the cost of an extra count value in r0) 并且打孔线用于该对齐,第一列是r0通过,接下来的三个是三次运行,最后一列是否有从先前运行到当前的增量(r0中额外计数值的成本)

00000000 0000000A 0000000A 0000000A 
00000001 00000014 00000014 00000014 0000000A 
00000002 0000001E 0000001E 0000001E 0000000A 
00000003 00000028 00000028 00000028 0000000A 
00000004 00000032 00000032 00000032 0000000A 
00000005 0000003C 0000003C 0000003C 0000000A 
00000006 00000046 00000046 00000046 0000000A 
00000007 00000050 00000050 00000050 0000000A 
00000008 0000005A 0000005A 0000005A 0000000A 
00000009 00000064 00000064 00000064 0000000A 
0000000A 0000006E 0000006E 0000006E 0000000A 
0000000B 00000078 00000078 00000078 0000000A 
0000000C 00000082 00000082 00000082 0000000A 
0000000D 0000008C 0000008C 0000008C 0000000A 
0000000E 00000096 00000096 00000096 0000000A 
0000000F 000000A0 000000A0 000000A0 0000000A 
00000010 000000AA 000000AA 000000AA 0000000A 
00000011 000000B4 000000B4 000000B4 0000000A 
00000012 000000BE 000000BE 000000BE 0000000A 
00000013 000000C8 000000C8 000000C8 0000000A 

then to make alignment checking easier which I didnt need to do in the end had it try different alignments for the above code (address in first column) and the results for a r0 of four. 如果对上面的代码(第一列中的地址)和r0为四的结果尝试不同的对齐,那么最终我不需要做对齐检查。

00010000 00000032 00010004 0000002D 00010008 00000032 0001000C 0000002D 00010000 00000032 00010004 0000002D 00010008 00000032 0001000C 0000002D

this repeats up to address 0x101FC 这重复到地址0x101FC

If I change the alignment in the compiled test 如果我在编译的测试中更改对齐方式

000080cc <COUNTER>:
    80cc:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}

000080d0 <delay>:
    80d0:   e2500001    subs    r0, r0, #1
    80d4:   4a000000    bmi 80dc <end_delay>
    80d8:   eafffffc    b   80d0 <delay>

000080dc <end_delay>:
    80dc:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
    80e0:   e0430002    sub r0, r3, r2
    80e4:   e12fff1e    bx  lr

then it is a wee bit faster. 然后它会慢一点。

00000000 00000009 00000009 00000009 
00000001 00000012 00000012 00000012 00000009 
00000002 0000001B 0000001B 0000001B 00000009 
00000003 00000024 00000024 00000024 00000009 
00000004 0000002D 0000002D 0000002D 00000009 
00000005 00000036 00000036 00000036 00000009 
00000006 0000003F 0000003F 0000003F 00000009 
00000007 00000048 00000048 00000048 00000009 
00000008 00000051 00000051 00000051 00000009 
00000009 0000005A 0000005A 0000005A 00000009 
0000000A 00000063 00000063 00000063 00000009 
0000000B 0000006C 0000006C 0000006C 00000009 
0000000C 00000075 00000075 00000075 00000009 
0000000D 0000007E 0000007E 0000007E 00000009 
0000000E 00000087 00000087 00000087 00000009 
0000000F 00000090 00000090 00000090 00000009 
00000010 00000099 00000099 00000099 00000009 
00000011 000000A2 000000A2 000000A2 00000009 
00000012 000000AB 000000AB 000000AB 00000009 
00000013 000000B4 000000B4 000000B4 00000009 

if I change it to be a function call 如果我把它改成函数调用

000080cc <COUNTER>:
    80cc:   e92d4001    push    {r0, lr}
    80d0:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}
    80d4:   eb000003    bl  80e8 <delay>
    80d8:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
    80dc:   e8bd4001    pop {r0, lr}
    80e0:   e0430002    sub r0, r3, r2
    80e4:   e12fff1e    bx  lr

000080e8 <delay>:
    80e8:   e2500001    subs    r0, r0, #1
    80ec:   4a000000    bmi 80f4 <end_delay>
    80f0:   eafffffc    b   80e8 <delay>

000080f4 <end_delay>:
    80f4:   e12fff1e    bx  lr

00000000 0000001A 0000001A 0000001A 
00000001 00000023 00000023 00000023 00000009 
00000002 0000002C 0000002C 0000002C 00000009 
00000003 00000035 00000035 00000035 00000009 
00000004 0000003E 0000003E 0000003E 00000009 
00000005 00000047 00000047 00000047 00000009 
00000006 00000050 00000050 00000050 00000009 
00000007 00000059 00000059 00000059 00000009 
00000008 00000062 00000062 00000062 00000009 
00000009 0000006B 0000006B 0000006B 00000009 
0000000A 00000074 00000074 00000074 00000009 
0000000B 0000007D 0000007D 0000007D 00000009 
0000000C 00000086 00000086 00000086 00000009 
0000000D 0000008F 0000008F 0000008F 00000009 
0000000E 00000098 00000098 00000098 00000009 
0000000F 000000A1 000000A1 000000A1 00000009 
00000010 000000AA 000000AA 000000AA 00000009 
00000011 000000B3 000000B3 000000B3 00000009 
00000012 000000BC 000000BC 000000BC 00000009 
00000013 000000C5 000000C5 000000C5 00000009 

the cost per count is the same but the call overhead is more expensive 每个计数的成本是相同的,但是呼叫开销更昂贵

this allows me to use thumb mode just for fun, to avoid the mode change the linker added I made it a little faster (and consistent). 这允许我使用拇指模式只是为了好玩,以避免添加链接器的模式更改我使它更快(和一致)。

000080cc <COUNTER>:
    80cc:   e92d4001    push    {r0, lr}
    80d0:   e59f103c    ldr r1, [pc, #60]   ; 8114 <edel+0x2>
    80d4:   e59fe03c    ldr lr, [pc, #60]   ; 8118 <edel+0x6>
    80d8:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}
    80dc:   e12fff11    bx  r1

000080e0 <here>:
    80e0:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
    80e4:   e8bd4001    pop {r0, lr}
    80e8:   e0430002    sub r0, r3, r2
    80ec:   e12fff1e    bx  lr

000080f0 <delay>:
    80f0:   e2500001    subs    r0, r0, #1
    80f4:   4a000000    bmi 80fc <end_delay>
    80f8:   eafffffc    b   80f0 <delay>

000080fc <end_delay>:
    80fc:   e12fff1e    bx  lr
    8100:   e1a00000    nop         ; (mov r0, r0)
    8104:   e1a00000    nop         ; (mov r0, r0)
    8108:   e1a00000    nop         ; (mov r0, r0)

0000810c <del>:
    810c:   3801        subs    r0, #1
    810e:   d400        bmi.n   8112 <edel>
    8110:   e7fc        b.n 810c <del>

00008112 <edel>:
    8112:   4770        bx  lr

00000000 000000F4 0000001B 0000001B 
00000001 00000024 00000024 00000024 00000009 
00000002 0000002D 0000002D 0000002D 00000009 
00000003 00000036 00000036 00000036 00000009 
00000004 0000003F 0000003F 0000003F 00000009 
00000005 00000048 00000048 00000048 00000009 
00000006 00000051 00000051 00000051 00000009 
00000007 0000005A 0000005A 0000005A 00000009 
00000008 00000063 00000063 00000063 00000009 
00000009 0000006C 0000006C 0000006C 00000009 
0000000A 00000075 00000075 00000075 00000009 
0000000B 0000007E 0000007E 0000007E 00000009 
0000000C 00000087 00000087 00000087 00000009 
0000000D 00000090 00000090 00000090 00000009 
0000000E 00000099 00000099 00000099 00000009 
0000000F 000000A2 000000A2 000000A2 00000009 
00000010 000000AB 000000AB 000000AB 00000009 
00000011 000000B4 000000B4 000000B4 00000009 
00000012 000000BD 000000BD 000000BD 00000009 
00000013 000000C6 000000C6 000000C6 00000009

with this alignment 这种对齐方式

0000810e <del>:
    810e:   3801        subs    r0, #1
    8110:   d400        bmi.n   8114 <edel>
    8112:   e7fc        b.n 810e <del>

00008114 <edel>:
    8114:   4770        bx  lr


00000000 0000007E 0000001C 0000001C 
00000001 00000026 00000026 00000026 0000000A 
00000002 00000030 00000030 00000030 0000000A 
00000003 0000003A 0000003A 0000003A 0000000A 
00000004 00000044 00000044 00000044 0000000A 
00000005 0000004E 0000004E 0000004E 0000000A 
00000006 00000058 00000058 00000058 0000000A 
00000007 00000062 00000062 00000062 0000000A 
00000008 0000006C 0000006C 0000006C 0000000A 
00000009 00000076 00000076 00000076 0000000A 
0000000A 00000080 00000080 00000080 0000000A 
0000000B 0000008A 0000008A 0000008A 0000000A 
0000000C 00000094 00000094 00000094 0000000A 
0000000D 0000009E 0000009E 0000009E 0000000A 
0000000E 000000A8 000000A8 000000A8 0000000A 
0000000F 000000B2 000000B2 000000B2 0000000A 
00000010 000000BC 000000BC 000000BC 0000000A 
00000011 000000C6 000000C6 000000C6 0000000A 
00000012 000000D0 000000D0 000000D0 0000000A 
00000013 000000DA 000000DA 000000DA 0000000A 

so in some ideal world on this processor assuming a cache hit on the delay code 所以在这个处理器上的某个理想世界中假设缓存命中延迟代码

00000004 00000032 00000032 00000032 0000000A 
00000004 0000002D 0000002D 0000002D 00000009 
00000004 0000003E 0000003E 0000003E 00000009 
00000004 0000003F 0000003F 0000003F 00000009 
00000004 00000044 00000044 00000044 0000000A 

between 0x2D and 0x44 clocks to run that loop with r0 = 4 在0x2D和0x44时钟之间以r0 = 4运行该循环

Realistically on this platform without the cache enabled and/or what you might see if you get a cache miss. 实际上在这个平台上没有启用缓存和/或如果你得到缓存未命中,你会看到什么。

00000000 0000030B 000002B7 000002ED 
00000001 0000035B 00000389 000003E9 
00000002 000003FB 00000439 0000041B 
00000003 0000058F 000004E7 0000055B 
00000004 000005FF 0000069D 000006D1 
00000005 00000745 00000733 000006F7 
00000006 00000883 00000817 00000801 
00000007 00000873 00000853 0000089B 
00000008 00000923 00000B05 0000092F 
00000009 00000A3F 000009A9 00000B4D 
0000000A 00000B79 00000BA9 00000C57 
0000000B 00000C21 00000D13 00000B51 
0000000C 00000C0B 00000E91 00000DE9 
0000000D 00000D97 00000E0D 00000E81 
0000000E 00000E5B 0000100B 00000F25 
0000000F 00001097 00001095 00000F37 
00000010 000010DB 000010FD 0000118B 
00000011 00001071 0000114D 0000123F 
00000012 000012CF 0000126D 000011DB 
00000013 0000140D 0000143D 0000141B 
000002B7 0000143D 

the r0=4 line r0 = 4行

00000004 000005FF 0000069D 000006D1 

thats a lot of cpu counts... 这就是很多cpu计数......

Hopefully I have put this topic to bed. 希望我把这个话题放到床上。 While it is interesting to try to assume how fast code runs or how many counts, etc...It is not that simple on these types of processors, pipelines, caches, branch prediction, complicated system busses, using a common-ish core in various chip implementations where the chip vendor manages the memory/flash separate from the processor IP vendors code. 虽然有趣的是尝试假设代码运行速度有多快或有多少计数等等......在这些类型的处理器,流水线,高速缓存,分支预测,复杂的系统总线上,使用通用核心并不是那么简单。各种芯片实现,其中芯片供应商管理与处理器IP供应商代码分开的存储器/闪存。

I didnt mess with branch prediction on this second experiment, had I done that then alignment would not be so consistent, depending on how branch prediction is implemented it can vary its usefulness based on where the branch is relative to the fetch line as the next fetch has started or not or is a certain way through when the branch predictor determines it doesnt need to do that fetch and/or starts the branched fetch, in this case the branch is two ahead so you might not see it with this code, you would want some nops sprinkled in between so that the bmi destination is in a separate fetch line (in order to see the difference). 在第二个实验中我没有弄乱分支预测,如果我这样做,那么对齐就不会那么一致,这取决于分支预测是如何实现的,它可以根据分支相对于获取线的位置来改变其有用性作为下一个获取已启动或未启动或是某种方式,当分支预测器确定它不需要执行该提取和/或启动分支提取时,在这种情况下,分支是两个提前,所以你可能看不到这个代码,你会想要在它们之间插入一些nops,以便bmi目的地位于单独的提取行中(以便查看差异)。

And this is the easy stuff to manipulate, using the same machine code sequences and seeing those vary in execution time by what did we see. 这是操作的简单方法,使用相同的机器代码序列,并通过我们看到的内容看到执行时间的变化。 between 0x3F and 0x6D1 that is over 27x difference between fastest and slowest...for the same machine code. 在0x3F和0x6D1之间,对于相同的机器代码,最快和最慢之间的差异超过27倍。 changing the alignment of the code by one instruction (somewhere else in unrelated code has one more or one fewer instructions from a prior build) was 5 counts difference. 通过一条指令改变代码的对齐(在不相关的代码中的其他地方有一个或多一个来自先前构建的指令)是5个计数差异。

to be fair the mrc at the end of the test was probably part of the time 公平地说,测试结束时的mrc可能是时间的一部分

000080c8 <COUNTER>:
    80c8:   ee192f1d    mrc 15, 0, r2, cr9, cr13, {0}
    80cc:   ee193f1d    mrc 15, 0, r3, cr9, cr13, {0}
    80d0:   e0430002    sub r0, r3, r2
    80d4:   e12fff1e    bx  lr

resulted in a count of 1 with either alignment. 导致计数为1,两者都对齐。 so doesnt guarantee that it was only one count of error in the measurement, but likely wasnt a dozen. 所以并不能保证测量中只有一个错误计数,但可能不是十几个。

Anyway, I hope this helps your understanding. 无论如何,我希望这有助于你理解。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM