在 Cortex-M7 (stm32f7) 上分析 memcpy 性能

Question

SHORT VERSION: Performance metrics of the memcpy that gets pulled from the GNU ARM toolchain seem to vary wildly on ARM Cortex-M7 for different copy sizes, even though the code that copies the data always stays the same.简短版本：从 GNU ARM 工具链中提取的 memcpy 的性能指标在 ARM Cortex-M7 上对于不同的副本大小似乎差异很大，即使复制数据的代码始终保持不变。 What could be the cause of this?这可能是什么原因？

LONG VERSION:长版：

I am part of a team developing on stm32f765 microcontroller with GNU arm toolchain 11.2, linking newlib-nano implementation of the stdlib into our code.我是使用 GNU arm 工具链 11.2 开发 stm32f765 微控制器的团队的一员，将 stdlib 的newlib-nano实现链接到我们的代码中。

Recently, memcpy performace became a bottleneck in our project, and we discovered that memcpy implementation that gets pulled into our code from the newlib-nano was a simple byte-wise copy, which in hindsight should not have been surprising given the fact that the newlib-nano library is code-size optimized (compiled with -Os ).最近， memcpy性能成为了我们项目的瓶颈，我们发现从 newlib-nano 中提取到我们代码中的 memcpy 实现是一个简单的逐字节复制，考虑到 newlib -nano 库经过代码大小优化（使用-Os ）。

Looking at the source code of the cygwin-newlib , I've managed to track down the exact memcpy implementation that gets compiled and packaged with the nano library for ARMv7m:查看cygwin-newlib的源代码，我设法找到了与 ARMv7m 的 nano 库一起编译和打包的确切 memcpy 实现：

    void *
__inhibit_loop_to_libcall
memcpy (void *__restrict dst0,
    const void *__restrict src0,
    size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
  char *dst = (char *) dst0;
  char *src = (char *) src0;

  void *save = dst0;

  while (len0--)
    {
      *dst++ = *src++;
    }

  return save;
#else
(...)
#endif

We have decided to replace the newlib-nano memcpy implementation in our code with our own memcpy implementation, while sticking to newlib-nano for other reasons.我们决定用我们自己的 memcpy 实现替换我们代码中的 newlib-nano memcpy 实现，同时出于其他原因坚持使用 newlib-nano。 In the process, we decided to get some performance metrics to compare the new implementation with the old one.在此过程中，我们决定获取一些性能指标来比较新实现与旧实现。

However, making sense of the obtained metrics prooved to be a challenge for me.然而，事实证明，理解所获得的指标对我来说是一个挑战。

Measurement results:测量结果：

All the results in the table are cycle counts, obtained from reading DWT-CYCCNT values (more info on the actual measurement setup will be given below).表中的所有结果都是循环计数，通过读取DWT-CYCCNT值获得（有关实际测量设置的更多信息将在下面给出）。

In the table, 3 different memcpy implementations were compared.在表中，比较了 3 种不同的 memcpy 实现。 The first one is the default one that gets linked from the newlib-nano library, as suggested by the label memcpy_nano .第一个是从 newlib-nano 库链接的默认库，如 label memcpy_nano所建议的那样。 The second and third one are the most naive, dumbest data copy implementations in C, one that copies data byte per byte , and the other one that does it word per word :第二个和第三个是 C 中最幼稚、最愚蠢的数据复制实现，一个是逐字节复制数据，另一个是逐字复制：

memcpy_naive_bytewise(void *restrict dest, void *restrict src, size_t size)
{
    uint8_t *restrict u8_src = src,
            *restrict u8_dest = dest;

    for (size_t idx = 0; idx < size; idx++) {
        *u8_dest++ = *u8_src++;
    }

    return dest;
}

void *
memcpy_naive_wordwise(void *restrict dest, void *restrict src, size_t size)
{
    uintptr_t upt_dest = (uintptr_t)dest;

    uint8_t *restrict u8_dest = dest,
            *restrict u8_src  = src;

    while (upt_dest++ & !ALIGN_MASK) {
        *u8_dest++ = *u8_src++;
        size--;
    }

    word *restrict word_dest = (void *)u8_dest,
             *restrict word_src  = (void *)u8_src;

    while (size >= sizeof *word_dest) {
        *word_dest++ = *word_src++;
        size -= sizeof *word_dest;
    }

    u8_dest = (void *)word_dest;
    u8_src  = (void *)word_src;

    while (size--) {
        *u8_dest++ = *u8_src++;
    }

    return dest;
}

I am unable, for the life in me, figure out why does the performance of the memcpy_nano resemble the one of the naive word-per-word copy implementation at first (up until the 256 byte-sized copies), only to start resembling the performance of the naive byte-per-byte copy implementation from 256 byte-sized copies and upwards.对于我的生活，我无法弄清楚为什么memcpy_nano的性能最初类似于幼稚的逐字复制实现（直到 256 字节大小的副本），只是开始类似于从 256 字节大小的副本及更高版本的原始字节/字节副本实现的性能。

I have triple-checked that indeed, the expected memcpy implementation is linked with my code for every copy size that was measured.我已经三重检查，确实，预期的 memcpy 实现与我的代码相关联，用于测量的每个副本大小。 For example, this is the memcpy disassembly obtained for code measuring the performance of 16 byte-size memcpy vs 256 byte-size copy (where the discrepency first arises):例如，这是为测量 16 字节大小的 memcpy 与 256 字节大小的副本（首先出现差异的地方）的性能的代码获得的 memcpy 反汇编：

memcpy definition linked for the 16 byte-sized copy ( newlib-nano memcpy ):为16 字节大小的副本（ newlib-nano memcpy ）链接的 memcpy 定义：

08007a74 <memcpy>:
 8007a74:   440a        add r2, r1
 8007a76:   4291        cmp r1, r2
 8007a78:   f100 33ff   add.w   r3, r0, #4294967295
 8007a7c:   d100        bne.n   8007a80 <memcpy+0xc>
 8007a7e:   4770        bx  lr
 8007a80:   b510        push    {r4, lr}
 8007a82:   f811 4b01   ldrb.w  r4, [r1], #1
 8007a86:   f803 4f01   strb.w  r4, [r3, #1]!
 8007a8a:   4291        cmp r1, r2
 8007a8c:   d1f9        bne.n   8007a82 <memcpy+0xe>
 8007a8e:   bd10        pop {r4, pc}

memcpy definition linked for the 256 byte-sized copy ( newlib-nano memcpy ):为256 字节大小的副本（ newlib-nano memcpy ）链接的 memcpy 定义：

08007a88 <memcpy>:
 8007a88:   440a        add r2, r1
 8007a8a:   4291        cmp r1, r2
 8007a8c:   f100 33ff   add.w   r3, r0, #4294967295
 8007a90:   d100        bne.n   8007a94 <memcpy+0xc>
 8007a92:   4770        bx  lr
 8007a94:   b510        push    {r4, lr}
 8007a96:   f811 4b01   ldrb.w  r4, [r1], #1
 8007a9a:   f803 4f01   strb.w  r4, [r3, #1]!
 8007a9e:   4291        cmp r1, r2
 8007aa0:   d1f9        bne.n   8007a96 <memcpy+0xe>
 8007aa2:   bd10        pop {r4, pc}

As you can see, except for the difference in where the relative address of the function is, there is no change in the actual copy logic.可以看到，除了function的相对地址在哪里不同外，实际的复制逻辑并没有什么变化。

Measurement setup:测量设置：

Ensure memory and instruction caches are disabled, irqs disabled, DWT enabled:确保 memory 和指令缓存被禁用，irqs 被禁用，DWT 被启用：

SCB->CSSELR = (0UL << 1) | 0UL;         // Level 1 data cache
    __DSB();

    SCB->CCR &= ~(uint32_t)SCB_CCR_DC_Msk;  // disable D-Cache
    __DSB();
    __ISB();

    SCB_DisableICache();

    if(DWT->CTRL & DWT_CTRL_NOCYCCNT_Msk)
    {
        //panic
        while(1);
    }

    /* Enable DWT unit */
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    __DSB();

    /* Unlock DWT registers */
    DWT->LAR = 0xC5ACCE55;
    __DSB();

    /* Reset CYCCNT */
    DWT->CYCCNT = 0;

    /* Enable CYCCNT */
    DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

    __disable_irq();

    __DSB();
    __ISB();

Link one single memcpy version under test to the code, and one byte-size step.将一个正在测试的 memcpy 版本链接到代码，以及一个字节大小的步骤。 Compile the code with -O0 .使用-O0编译代码。 Then measure the execution time like (note: addresses of au8_dst and au8_src are always aligned):然后测量执行时间（注意：au8_dst 和 au8_src 的地址总是对齐的）：

uint8_t volatile au8_dst[MAX_BYTE_SIZE];
uint8_t volatile au8_src[MAX_BYTE_SIZE];

    __DSB();
    __ISB();

    u32_cyccntStart = DWT->CYCCNT;

    __DSB();
    __ISB();

    memcpy(au8_dst, au8_src, u32_size);

    __DSB();
    __ISB();

    u32_cyccntEnd = DWT->CYCCNT;

    __DSB();
    __ISB();

    *u32_cyccnt = u32_cyccntEnd - u32_cyccntStart;

Repeat this procedure for every combination of byte-size and memcpy version对字节大小和 memcpy 版本的每个组合重复此过程

Main question How is it possible for the execution time of the newlib-nano memcpy to follow that of a naive word-wise copy implementation up to the byte size of 256 bytes, after which it performs similarly to a naive implementation of a byte-wise copy?主要问题newlib-nano memcpy 的执行时间如何可能遵循天真的逐字复制实现的执行时间，直到字节大小为 256 字节，之后它的执行类似于按字节的天真实现复制？ Please have in mind that the definition of the newlib-nano memcpy that gets pulled into the code is the same for every byte-size measurement, as demonstrated with the disassembly provided above.请记住，拉入代码中的 newlib-nano memcpy 的定义对于每个字节大小测量都是相同的，如上面提供的反汇编所示。 Is my measurement setup flawed in some obvious way that I have failed to recognize?我的测量设置是否存在我无法识别的明显缺陷？

Any thoughts on this would be highly, highly appreciated!对此的任何想法都将受到高度赞赏！

Answer 1

As mentioned in comments it may be your alignment which you need to take into account for performance tests.正如评论中提到的，它可能是您的 alignment，您需要考虑性能测试。 It can be the case that one memcpy solution vs another may be hitting these fetch lines as I call them.可能是一种 memcpy 解决方案与另一种解决方案可能会遇到我所说的这些获取行。

An stm32 cortex-m7 part.一个 stm32 cortex-m7 部分。

Code under test:被测代码：

/* r0 count */
/* r1 timer address */
.thumb_func
.globl TEST
TEST:
    push {r4,r5}
    ldr r4,[r1]

loop:
    sub r0,#1
    bne loop

    ldr r5,[r1]
    sub r0,r4,r5
    pop {r4,r5}
    bx lr

Original alignment原装 alignment

08000100 <TEST>:
 8000100:   b430        push    {r4, r5}
 8000102:   680c        ldr r4, [r1, #0]

08000104 <loop>:
 8000104:   3801        subs    r0, #1
 8000106:   d1fd        bne.n   8000104 <loop>
 8000108:   680d        ldr r5, [r1, #0]
 800010a:   1b60        subs    r0, r4, r5
 800010c:   bc30        pop {r4, r5}
 800010e:   4770        bx  lr

systick timer used, no reason to use the debug timer it adds no value.使用 systick 计时器，没有理由使用它没有任何价值的调试计时器。

ra=TEST(0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

First run第一次运行

This is an stm32 so there is a flash cache that you cannot disable, so you can see that above in the first run.这是一个 stm32，所以有一个不能禁用的 flash 缓存，所以你可以在第一次运行时看到上面的内容。

The loop is aligned as such循环是这样对齐的

 8000104:   3801        subs    r0, #1
 8000106:   d1fd        bne.n   8000104 <loop>

Add a nop to move the loop a half word添加一个 nop 将循环移动半个字

08000100 <TEST>:
 8000100:   46c0        nop         ; (mov r8, r8)
 8000102:   b430        push    {r4, r5}
 8000104:   680c        ldr r4, [r1, #0]

08000106 <loop>:
 8000106:   3801        subs    r0, #1
 8000108:   d1fd        bne.n   8000106 <loop>
 800010a:   680d        ldr r5, [r1, #0]
 800010c:   1b60        subs    r0, r4, r5
 800010e:   bc30        pop {r4, r5}
 8000110:   4770        bx  lr

The whole test is the same machine code from when the timer is read to timer read.从读取计时器到读取计时器，整个测试是相同的机器代码。

But the performance is dramatically different但性能却大不相同

Taking twice as long to execute.执行时间是原来的两倍。

If as documented that the fetch is 64 bits that is 4 instructions per fetch.如果如文件所述，取指是 64 位，即每次取指 4 条指令。

If I add one nop per test如果我每次测试添加一个 nop

I get three more that return 0x1000 and then...我又得到三个返回 0x1000 然后......

08000100 <TEST>:
 8000100:   46c0        nop         ; (mov r8, r8)
 8000102:   46c0        nop         ; (mov r8, r8)
 8000104:   46c0        nop         ; (mov r8, r8)
 8000106:   46c0        nop         ; (mov r8, r8)
 8000108:   46c0        nop         ; (mov r8, r8)
 800010a:   b430        push    {r4, r5}
 800010c:   680c        ldr r4, [r1, #0]

0800010e <loop>:
 800010e:   3801        subs    r0, #1
 8000110:   d1fd        bne.n   800010e <loop>
 8000112:   680d        ldr r5, [r1, #0]
 8000114:   1b60        subs    r0, r4, r5
 8000116:   bc30        pop {r4, r5}
 8000118:   4770        bx  lr
 
00002010 
00002001 
00002001 
00002001

You can run this in sram to avoid the cache, and do other things but I expect that you will see the same effect as you hit boundaries that add an extra fetch to the loop.您可以在 sram 中运行它来避免缓存，并执行其他操作，但我希望您会看到与在循环中添加额外提取的边界相同的效果。 Clearly this is best case with one fetch for the whole loop then sometimes two.显然，最好的情况是整个循环一次取回，有时两次取回。 Make the loop longer and it becomes N and then N+1 fetches with a less severe ratio.使循环更长，它变为 N，然后 N+1 以较不严重的比率进行提取。

I also assume the systick here is the arm clock divided by two, which is perfectly fine for this kind of performance testing.我还假设这里的 systick 是 arm 时钟除以 2，这非常适合这种性能测试。

So it is quite possible that due to the alignment of the two different functions one may be getting a performance hit and the other not from extra fetches.因此，由于 alignment 的两个不同功能，很可能一个可能会受到性能影响，另一个可能不是来自额外的提取。

What I tend to do as I did here is turn the code under test into asm, I put it in the bootstrap up near the front of the binary so that any of the other code I add or remove does not affect the alignment.我在这里所做的倾向于将被测代码转换为 asm，我将它放在二进制文件前面附近的引导程序中，这样我添加或删除的任何其他代码都不会影响 alignment。 I can also wrap the timer around it and loops in a very controlled manner.我还可以将计时器包裹在它周围并以非常可控的方式循环。 Adding nops, outside the timed area, to move the alignment of the loops.在定时区域之外添加 nop 以移动循环的 alignment。 If you have more than one loop in the code under test you can add nops in the middle of the code under test to control the alignment of each of the loops.如果您在被测代码中有多个循环，您可以在被测代码中间添加 nop 来控制每个循环的 alignment。

You will also want to play with alignment of the data, I do not remember off hand how the cortex-ms handle unaligned accesses, if they support it, I assume they do with a performance penalty.您还想使用 alignment 的数据，我不记得 cortex-ms 如何处理未对齐的访问，如果它们支持它，我认为它们会降低性能。

I demonstrated something similar to the above against MCUs, which affects you here as well.我针对 MCU 演示了与上述类似的东西，这也影响了你。 Since the srams (normal sram memory or cache memory for that matter) is not organized as bytes it is at least 32 bits wide (or wider if ecc/parity).由于 sram（普通 sram memory 或缓存 memory ）不是按字节组织的，因此它至少为 32 位宽（如果 ecc/奇偶校验则更宽）。 So a single byte write requires a read-modify-write, same for a halfword, but an aligned word write does not require that read.因此，单字节写入需要读取-修改-写入，半字相同，但对齐字写入不需要读取。 Often this is buried in the noise because you are not doing enough writes back to back to get back pressure from the sram control logic.这通常被埋没在噪音中，因为您没有做足够的背靠背写入来从 sram 控制逻辑获得背压。 But at least one MCU did actually mention that you could/would see that performance, and I posted that at some point here on SO.但至少有一个 MCU 确实提到你可以/将会看到这种性能，我在 SO 上的某个时间点发布了这一点。 You should also see this with unaligned word writes, that now you need two read-modify-writes.您还应该在未对齐的字写入中看到这一点，现在您需要两次读取-修改-写入。

Obviously four store instructions takes more time than one word instruction.显然，四个存储指令比一个字指令花费更多的时间。

I will just do it why not我会这样做为什么不

/* r0 address */
/* r1 count */
/* r2 timer address */
.thumb_func
.globl swtest
swtest:
    push {r4,r5}
    ldr r4,[r2]
    
swloop:
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]

    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]

    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]

    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    sub r1,#1
    bne swloop
    
    ldr r5,[r2]
    sub r0,r4,r5
    pop {r4,r5}
    bx lr


ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

00012012 
0001200A 
0001200A 
0001200A 
0002FFFD 
0002FFFD 
0002FFFD 
0002FFFD

Unaligned is more than twice longer to execute. Unaligned 的执行时间要长两倍以上。

Unfortunately you cannot control the addresses for a generic memcpy, so the addresses could be 0x1000 and 0x2001 and it is just going to be slow.不幸的是，您无法控制通用 memcpy 的地址，因此地址可能是 0x1000 和 0x2001，而且速度会很慢。 But if the exercise here is because you have code that you need to copy often (and there is no DMA mechanism in the chip that makes that faster, remember DMA is not free, sometimes it is just a lazy approach that uses less code but runs slower, understand the architecture) but if you can control that to be word aligned addresses and whole number of word at least amounts of data to copy, then make your own copy and not not call it memcpy.但是如果这里的练习是因为你有需要经常复制的代码（并且芯片中没有 DMA 机制可以使它更快，请记住 DMA 不是免费的，有时它只是一种使用较少代码但运行的惰性方法较慢，了解架构），但如果您可以将其控制为字对齐地址和整个字数，至少要复制的数据量，然后制作您自己的副本，而不是将其称为 memcpy。 And then hand tune it.然后手动调。

Edit, running from SRAM编辑，从 SRAM 运行

for(rd=0;rd<8;rd++)
{
    rb=0x20002000;
    for(rc=0;rc<rd;rc++)
    {
        PUT32(rb,0xb430); rb+=2; //46c0         nop         ; (mov r8, r8)
    }

    PUT32(rb,0xb430); rb+=2; // 800010a:    b430        push    {r4, r5}
    PUT32(rb,0x680c); rb+=2; // 800010c:    680c        ldr r4, [r1, #0]
                             //0800010e <loop>:
    PUT32(rb,0x3801); rb+=2; // 800010e:    3801        subs    r0, #1
    PUT32(rb,0xd1fd); rb+=2; // 8000110:    d1fd        bne.n   800010e <loop>
    PUT32(rb,0x680d); rb+=2; // 8000112:    680d        ldr r5, [r1, #0]
    PUT32(rb,0x1b60); rb+=2; // 8000114:    1b60        subs    r0, r4, r5
    PUT32(rb,0xbc30); rb+=2; // 8000116:    bc30        pop {r4, r5}
    PUT32(rb,0x4770); rb+=2; // 8000118:    4770        bx  lr
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;

    ra=HOP(0x1000,STK_CVR,0x20002001);  hexstrings(rd); hexstring(ra%0x00FFFFFF);
    ra=HOP(0x1000,STK_CVR,0x20002001);  hexstrings(rd); hexstring(ra%0x00FFFFFF);
    ra=HOP(0x1000,STK_CVR,0x20002001);  hexstrings(rd); hexstring(ra%0x00FFFFFF);
    ra=HOP(0x1000,STK_CVR,0x20002001);  hexstrings(rd); hexstring(ra%0x00FFFFFF);

}


00000000 00001011 
00000000 00001006 
00000000 00001006 
00000000 00001006 
00000001 00002010 
00000001 00002003 
00000001 00002003 
00000001 00002003 
00000002 00001014 
00000002 00001006 
00000002 00001006 
00000002 00001006 
00000003 00001014 
00000003 00001006 
00000003 00001006 
00000003 00001006 
00000004 00001014 
00000004 00001006 
00000004 00001006 
00000004 00001006 
00000005 00002010 
00000005 00002001 
00000005 00002002 
00000005 00002002 
00000006 00001012 
00000006 00001006 
00000006 00001006 
00000006 00001006 
00000007 00001014 
00000007 00001006 
00000007 00001006 
00000007 00001006

Now still seeing that cache like effect.现在仍然看到类似缓存的效果。 I do see that my CCR is 0x00040200 and I cannot disable it I believe the m7 says that you cannot.我确实看到我的 CCR 是 0x00040200，我不能禁用它我相信 m7 说你不能。

Okay BTAC was being used but setting bit 13 in the ACTLR changes it to static branch prediction.好的 BTAC 正在使用，但在 ACTLR 中设置第 13 位会将其更改为 static 分支预测。 So now the times actually make more sense, from sram:所以现在时代实际上更有意义，来自 sram：

00000000 00004003 
00000000 00004003 
00000000 00004003 
00000000 00004003 
00000001 00005002 
00000001 00005002 
00000001 00005002 
00000001 00005002 
00000002 00004003 
00000002 00004003 
00000002 00004003 
00000002 00004003 
00000003 00004003 
00000003 00004003 
00000003 00004003 
00000003 00004003 
00000004 00004003 
00000004 00004003 
00000004 00004003 
00000004 00004003 
00000005 00005002 
00000005 00005002 
00000005 00005002 
00000005 00005002 
00000006 00004003 
00000006 00004003 
00000006 00004003 
00000006 00004003 
00000007 00004003 
00000007 00004003 
00000007 00004003 
00000007 00004003

we do see the extra fetch line but each run is consistent from sram.我们确实看到了额外的 fetch 行，但每次运行都与 sram 一致。

Flash also showed no variation from one test to another even though I know that st has a cache thing. Flash 也没有显示出从一个测试到另一个测试的变化，即使我知道 st 有一个缓存的东西。

00010FFC 
00010FFC 
00010FFC 
00010FFC

This performance for flash also feels right relative to running from sram, flash is slow and not much you can do about it, so the numbers above did seem strange. flash 的这种性能相对于从 sram 运行来说也感觉不错，flash 速度很慢，您对此无能为力，所以上面的数字确实看起来很奇怪。 And this demonstrates how many traps you can fall into in performance testing, and why all benchmarks are b......t.这说明了您在性能测试中可以陷入多少陷阱，以及为什么所有基准测试都是 b......t。

And since I am having so much fun with this answer, also note that it is expected that unaligned reads should also take a performance hit to unaligned reads assuming the sram is 32 bits wide, it takes two sram bus cycles to read unaligned vs one cycle for aligned, and that should back pressure if you hit it hard enough.而且由于我对这个答案很感兴趣，还请注意，假设 sram 为 32 位宽，预计未对齐读取也会对未对齐读取造成性能影响，读取未对齐与一个周期需要两个 sram 总线周期对齐，如果你用力击打它应该会背压。

With BTAC disabled禁用 BTAC

ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

ra=lwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=lwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=lwtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=lwtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

store word aligned
00019FFE 
00019FFE 
store word unaligned
00030007 
00030007 
load word aligned
00020001 
00020001 
load word unaligned
0002A00C 
0002A00C

so if your memcpy is from 0x1000 to 0x2002 or from 0x1001 to 0x2002, even if you align one up front and then do word based copies, you still get a performance hit.因此，如果您的 memcpy 是从 0x1000 到 0x2002 或从 0x1001 到 0x2002，即使您先对齐一个然后进行基于字的复制，您仍然会受到性能影响。 Which is why I mention you need to try different alignments.这就是为什么我提到你需要尝试不同的对齐方式。

On one your questions too, I remember the full sized arm memcpy from years ago I think in newlib they had a few peformance steps, for example if the amount to copy was less than x they would just do a byte loop, done.关于你的一个问题，我记得几年前的全尺寸 arm memcpy 我认为在 newlib 中他们有一些性能步骤，例如，如果要复制的数量小于 x，他们只会做一个字节循环，完成。 Otherwise they would at least try to align one of them if it started at 0x1001 then they would do one byte, one halfword then a bunch of words or muliple words then based on length an extra halfword or byte at the end to finish.否则，如果其中一个从 0x1001 开始，他们至少会尝试对齐它们，然后他们会做一个字节，一个半字，然后是一堆字或多个字，然后根据长度在末尾添加一个额外的半字或字节来完成。 But that only works...if both pointers are aligned or misaligned the same way.但这仅适用...如果两个指针以相同的方式对齐或未对齐。

From your table it did not seem to me you were taking all of these factors into account.从你的表来看，在我看来你并没有考虑到所有这些因素。 You fell into the benchmarks are b......t with one benchmark representing one source code even though that core/chip/system can run that code in a different number of clocks, sometimes strictly as a result of the C compiler and linker and no other factors.你陷入了基准测试，一个基准代表一个源代码，即使该内核/芯片/系统可以在不同数量的时钟下运行该代码，有时严格是由于 C 编译器和linker 没有其他因素。

And again然后再次

beg=get_timer();
for(i = 0;i<1000;i++)
{
  memcpy(a,b);
}
end=get_timer();

amplifies your measurement error.放大您的测量误差。 The for loop alone calling memcpy is also subject to fetching and branch prediction.单独调用 memcpy 的 for 循环也受到提取和分支预测的影响。 I hope you are not testing like this.我希望你不要这样测试。

在 Cortex-M7 (stm32f7) 上分析 memcpy 性能

问题描述

1 个解决方案

解决方案1
0 2022-09-12 19:18:00

在 Cortex-M7 (stm32f7) 上分析 memcpy 性能

问题描述

1 个解决方案

解决方案1 0 2022-09-12 19:18:00

解决方案1
0 2022-09-12 19:18:00