优化ARM Cortex M3代码

Question

I have a C Function which tries to copy a framebuffer to FSMC RAM. 我有一个C函数，它试图将帧缓冲区复制到FSMC RAM。

The functions eats the frame rate of the game loop to 10FPS. 这些函数将游戏循环的帧速率降低到10FPS。 I would like to know how to analyze the disassembled function, should I count each instruction cycle ? 我想知道如何分析反汇编函数，我应该计算每个指令周期吗？ I want to know where the CPU spend its time, in which part. 我想知道CPU在哪里花费时间，在哪一部分。 I'm sure that the algorithm is also a problem, because its O(N^2) 我确定算法也是一个问题，因为它的O（N ^ 2）

The C Function is: C函数是：

void LCD_Flip()
{

    u8  i,j;


    LCD_SetCursor(0x00, 0x0000);
    LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
    LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
    LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
    LCD_WriteRegister(0x0053,319);//Vertical GRAM end position
    LCD_WriteIndex(0x0022);

    for(j=0;j<fbHeight;j++)
    {
        for(i=0;i<240;i++)
        {
            u16 color = frameBuffer[i+j*fbWidth];
            LCD_WriteData(color);

        }
    }

}

Disassembled function: 拆卸功能：

08000fd0 <LCD_Flip>:
 8000fd0:   b580        push    {r7, lr}
 8000fd2:   b082        sub sp, #8
 8000fd4:   af00        add r7, sp, #0
 8000fd6:   2000        movs    r0, #0
 8000fd8:   2100        movs    r1, #0
 8000fda:   f7ff fde9   bl  8000bb0 <LCD_SetCursor>
 8000fde:   2050        movs    r0, #80 ; 0x50
 8000fe0:   2100        movs    r1, #0
 8000fe2:   f7ff feb5   bl  8000d50 <LCD_WriteRegister>
 8000fe6:   2051        movs    r0, #81 ; 0x51
 8000fe8:   21ef        movs    r1, #239    ; 0xef
 8000fea:   f7ff feb1   bl  8000d50 <LCD_WriteRegister>
 8000fee:   2052        movs    r0, #82 ; 0x52
 8000ff0:   2100        movs    r1, #0
 8000ff2:   f7ff fead   bl  8000d50 <LCD_WriteRegister>
 8000ff6:   2053        movs    r0, #83 ; 0x53
 8000ff8:   f240 113f   movw    r1, #319    ; 0x13f
 8000ffc:   f7ff fea8   bl  8000d50 <LCD_WriteRegister>
 8001000:   2022        movs    r0, #34 ; 0x22
 8001002:   f7ff fe87   bl  8000d14 <LCD_WriteIndex>
 8001006:   2300        movs    r3, #0
 8001008:   71bb        strb    r3, [r7, #6]
 800100a:   e01b        b.n 8001044 <LCD_Flip+0x74>
 800100c:   2300        movs    r3, #0
 800100e:   71fb        strb    r3, [r7, #7]
 8001010:   e012        b.n 8001038 <LCD_Flip+0x68>
 8001012:   79f9        ldrb    r1, [r7, #7]
 8001014:   79ba        ldrb    r2, [r7, #6]
 8001016:   4613        mov r3, r2
 8001018:   011b        lsls    r3, r3, #4
 800101a:   1a9b        subs    r3, r3, r2
 800101c:   011b        lsls    r3, r3, #4
 800101e:   1a9b        subs    r3, r3, r2
 8001020:   18ca        adds    r2, r1, r3
 8001022:   4b0b        ldr r3, [pc, #44]   ; (8001050 <LCD_Flip+0x80>)
 8001024:   f833 3012   ldrh.w  r3, [r3, r2, lsl #1]
 8001028:   80bb        strh    r3, [r7, #4]
 800102a:   88bb        ldrh    r3, [r7, #4]
 800102c:   4618        mov r0, r3
 800102e:   f7ff fe7f   bl  8000d30 <LCD_WriteData>
 8001032:   79fb        ldrb    r3, [r7, #7]
 8001034:   3301        adds    r3, #1
 8001036:   71fb        strb    r3, [r7, #7]
 8001038:   79fb        ldrb    r3, [r7, #7]
 800103a:   2bef        cmp r3, #239    ; 0xef
 800103c:   d9e9        bls.n   8001012 <LCD_Flip+0x42>
 800103e:   79bb        ldrb    r3, [r7, #6]
 8001040:   3301        adds    r3, #1
 8001042:   71bb        strb    r3, [r7, #6]
 8001044:   79bb        ldrb    r3, [r7, #6]
 8001046:   2b63        cmp r3, #99 ; 0x63
 8001048:   d9e0        bls.n   800100c <LCD_Flip+0x3c>
 800104a:   3708        adds    r7, #8
 800104c:   46bd        mov sp, r7
 800104e:   bd80        pop {r7, pc}

Answer 1

Not exactly answering your question, but I see you aspire for fast execution of the loops. 不完全回答你的问题，但我看到你渴望快速执行循环。

Here are some tips from the book: 'ARM System Developer's Guide: Designing and Optimizing System Software (The Morgan Kaufmann Series in Computer Architecture and Design)' http://www.amazon.com/ARM-System-Developers-Guide-Architecture/dp/1558608745 以下是本书的一些提示：“ARM系统开发人员指南：设计和优化系统软件（计算机体系结构和设计中的Morgan Kaufmann系列）” http://www.amazon.com/ARM-System-Developers-Guide-Architecture / DP / 1558608745

Chapter 5 contains section named 'C looping structures'. 第5章包含名为“C循环结构”的部分。 Here is the summary of the section: 以下是该部分的摘要：

Writing Loops Efficiently 有效地编写循环

Use loops that count down to zero. 使用倒数为零的循环。 Then the compiler does not need to allocate a register to hold the termination value, and the comparison with zero is free. 然后编译器不需要分配寄存器来保存终止值，并且与零的比较是免费的。
Use unsigned loop counters by default and the continuation condition i!=0 rather than i>0. 默认情况下使用无符号循环计数器，并且连续条件为i！= 0而不是i> 0。 This will ensure that the loop overhead is only two instructions. 这将确保循环开销仅为两条指令。
Use do-while loops rather than for loops when you know the loop will iterate at least once. 当您知道循环将至少迭代一次时，请使用do-while循环而不是for循环。 This saves the compiler checking to see if the loop count is zero. 这样可以保存编译器检查以查看循环计数是否为零。
Unroll important loops to reduce the loop overhead. 展开重要循环以减少循环开销。 Do not overunroll. 不要过度使用。 If the loop overhead is small as a proportion of the total, then unrolling will increase code size and hurt the performance of the cache. 如果循环开销占总数的一小部分，那么展开会增加代码大小并损害缓存的性能。
Try to arrange that the number of elements in arrays are multiples of four or eight. 尝试安排数组中元素的数量是四或八的倍数。 You can then unroll loops easily by two, four, or eight times without worrying about the leftover array elements. 然后，您可以轻松地将循环展开两次，四次或八次，而无需担心剩余的数组元素。

Based on the summary, your inner loop might look as below. 根据摘要，您的内部循环可能如下所示。

uinsigned int i = 240/4;  // Use unsigned loop counters by default
                          // and the continuation condition i!=0

do
{
    // Unroll important loops to reduce the loop overhead
    LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
}
while ( i != 0 )  // Use do-while loops rather than for
                  // loops when you know the loop will
                  // iterate at least once

You might want to experiment also with 'pragmas', eg : 您可能还想尝试使用'pragma'，例如：

#pragma Otime

http://www.keil.com/support/man/docs/armcc/armcc_chr1359124989673.htm http://www.keil.com/support/man/docs/armcc/armcc_chr1359124989673.htm

#pragma unroll(n)

http://www.keil.com/support/man/docs/armcc/armcc_chr1359124992247.htm http://www.keil.com/support/man/docs/armcc/armcc_chr1359124992247.htm

And as it is Cortex-M3 try to find out if MCU hardware gives you chance to arrange the code/data to take advantage of its Harvard architecture (I experienced 30% speed increase). 因为它是Cortex-M3试图找出MCU硬件是否有机会安排代码/数据以利用其哈佛架构（我的速度提高了30％）。

see here my other answer 看到我的另一个答案

Maybe not everything may be applicable in your application (filling a buffer in reverse order). 也许并非所有内容都适用于您的应用程序（以相反的顺序填充缓冲区）。 I just wanted to draw your attention to the book and possible points for optimization. 我只是想提请你注意这本书以及可能的优化要点。

Answer 2

You should start by compiling the C code with speed optimizations enabled. 您应该首先在启用速度优化的情况下编译C代码。 The disassembled code you provide appears to be storing the i and j counters on the stack, which adds 3 load/store operations to the inner loop. 您提供的反汇编代码似乎是在堆栈中存储i和j计数器，这会向内部循环添加3个加载/存储操作。 You might also want to inline LCD_WriteData in the inner loop. 您可能LCD_WriteData在内部循环中内联LCD_WriteData 。

On the other hand, if you are really writing to the LCD in the inner loop then the performance may be limited by that interface. 另一方面，如果您真的在内循环中写入LCD，那么性能可能会受到该接口的限制。

Answer 3

Just to purely reduce the number of looped operations, you could do something like so. 只是为了纯粹减少循环操作的数量，你可以这样做。 I did make some assumptions which may not be accurate: You had a loop that went from i=0:239 , and I am assuming that fbWidth is the same as 240 . 我确实做了一些可能不准确的假设：你有一个从i=0:239开始的循环，我假设fbWidth与240相同。 If this isn't true then the loop would have to be more complicated. 如果不是这样，那么循环必须更复杂。

void LCD_Flip()
{
    u16 i,limit = fbHeight+fbWidth;
    // We will use a precalculated limit and one single loop

    LCD_SetCursor(0x00, 0x0000);
    LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
    LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
    LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
    LCD_WriteRegister(0x0053,319);//Vertical GRAM end position
    LCD_WriteIndex(0x0022);

    // Single loop from 0:limit-1 takes care of having to do an
    // x,y conversion each iteration.
    for(i=0;i<limit;j++)
    {
        u16 color = frameBuffer[i];
        LCD_WriteData(color);
    }
}

This strips out the two loops in favor of a single for loop with only one conditional test per iteration. 这剥离了两个循环，有利于单个for循环，每次迭代只有一个条件测试。 On top of that, the indexing into frameBuffer is now linear, so we don't need to multiply out the width to go from x,y to linear storage. 最重要的是，对frameBuffer的索引现在是线性的，因此我们不需要将宽度乘以从x，y到线性存储。 Your loop iterations won't have been reduced (ie it is still O(N) with N = height*width ), but the number of instructions should have been reduced. 你的循环迭代不会减少（即它仍然是O(N) ， N = height*width ），但指令的数量应该减少。

As @Joe Hass noted in his answer, this may not actually help at all if you are really limited by the LCD interface. 正如@Joe Hass在他的回答中指出的那样，如果你真的受到LCD界面的限制，这实际上可能根本没有帮助。 Depending on which STM32 you're using, the FSMC may not be particularly fast, and I can't imagine the LCD controller would be very fast either. 根据您使用的STM32，FSMC可能不会特别快，我无法想象LCD控制器也会非常快。

优化ARM Cortex M3代码

问题描述

3 个解决方案

解决方案1
6 已采纳 2014-05-11 23:03:00

解决方案2
3

解决方案3
1 2014-05-02 18:57:23

优化ARM Cortex M3代码

问题描述

3 个解决方案

解决方案1 6 已采纳 2014-05-11 23:03:00

解决方案2 3

解决方案3 1 2014-05-02 18:57:23

解决方案1
6 已采纳 2014-05-11 23:03:00

解决方案2
3

解决方案3
1 2014-05-02 18:57:23