优化嵌套循环以填充数组，以帮助编译器生成高效的 ARM 程序集？

Question

I have just been given an assignment to re-write the following C function, to help the ARM compiler produce more efficient assembly code.我刚刚获得了重新编写以下 C function 的任务，以帮助 ARM 编译器生成更高效的汇编代码。 Does anyone know how to do this?有谁知道如何做到这一点？

void some_function(int *data)
{
    int  i, j;

    for (i = 0; i < 64; i++)
    {
        for (j = 0; j < 64; j++)
            data[j + 64*i] = (i + j)/2;
    }
}

Answer 1

First (as Jonathan Leffler mentioned) the compiler is likely to do so good a job already that trying to optimise by writing specific C code is usually commercially questionable, ie you lose more money via development time than you can make by slightly faster code.首先（正如 Jonathan Leffler 所提到的）编译器可能已经做得很好了，以至于试图通过编写特定的 C 代码来进行优化通常在商业上是有问题的，也就是说，你在开发时间上损失的钱比通过稍快的代码赚到的钱要多。
But sometimes it is worth it;但有时它是值得的； let's assume it is the case here.让我们假设这里是这种情况。

If you do optimise, do so while measuring.如果您确实进行了优化，请在测量时这样做。 It is very possible to write code which ends up being less optimal, because in subtle ways otherwise possible compiler optimisations are foiled.很有可能编写最终不太优化的代码，因为以微妙的方式，否则可能的编译器优化会被挫败。 Also, whether and how much optimisation works depends on the environment, ie measuring in all potential environments is necessary.此外，优化是否有效以及优化的程度取决于环境，即在所有潜在环境中进行测量是必要的。

Ok, after that wise-cracking, here is code in which I demonstrate optimisations as proposed in comments, one of them by Jonathan Leffler:好的，在明智的破解之后，这里是我演示了评论中提出的优化的代码，其中之一是 Jonathan Leffler：

/* Jonathan Leffler */
void some_function(int *data)
{
    int  i, j;
    int  k = 0;

    for (i = 0; i < 64; i++)
    {
        for (j = 0; j < 64; j++)
        {
            data[k++] = (i + j)/2;
        }
    }
}

/* Yunnosch 1, loop unrolling by 2 */
void some_function(int *data)
{
    int  i, j;

    for (i = 0; i < 64; i++)
    {
        for (j = 0; j < 64; j+=2)
            data[j +     64*i] = (i + j  )/2;
            data[j + 1 + 64*i] = (i + j+1)/2;
    }
}

/* Yunnosch 1 and Jonathan Leffler */
void some_function(int *data)
{
    int  i, j;
    int k=0; /* Jonathan Leffler */

    for (i = 0; i < 64; i++)
    {
        for (j = 0; j < 64; j+=2) /* Yunnosch */
        {
            data[k++] = (i + j  )/2;
            data[k++] = (i + j+1)/2; /* Yunnosch */
        }
    }
}

/* Yunnosch 2, avoiding the /2, including Jonathan Leffler */
/* Well, duh. This is harder than I thought... 
   I admit that this is NOT tested, I want to demonstrate the idea.
   Everybody feel free to help the very grateful me with fixing errors. */
void some_function(int *data)
{
    int  i, j;
    int  k=0;

    for (i = 0; i < 32; i++) /* magic numbers I normally avoid, 32 is 64/2 */
    {
        for (j = 0; j < 32; j++)
        {
            data[k     ] = (i + j);
            data[k+1   ] = (i + j);
            data[k  +64] = (i + j);
            data[k+1+64] = (i + j +1);
            k+=2;
        }
        k+=64;
    }
}

The last version is based on the following observable 2x2 group pattern in the desired result, as seen in a 2D interpretation:最后一个版本基于期望结果中的以下可观察 2x2 组模式，如 2D 解释所示：

00 11 ...
01 12 ...

11 22 ...
12 23 ...
.. ..
.. ..
.. ..
´´´´

Answer 2

Optimizing C code to generate "more efficient assembly code" for a specific compiler/processor is something you normally shouldn't do.优化 C 代码以为特定编译器/处理器生成“更高效的汇编代码”是您通常不应该做的事情。 Write clear and easy to understand C code and let the compiler do the optimization.编写清晰易懂的 C 代码，让编译器进行优化。

Even if you make all kinds of tricks with the C code and end up with "more efficient assembly code" for your specific compiler/processor, it may turn out that a simple compiler upgrade may ruin the whole thing and you'll have to change the C code again.即使您使用 C 代码进行各种技巧并最终为您的特定编译器/处理器提供“更有效的汇编代码”，但结果可能是简单的编译器升级可能会破坏整个事情，您将不得不改变C 代码。

For something as simple as your code, write it in assembler code from the start.对于像你的代码这样简单的东西，从一开始就用汇编代码编写它。 But be aware that you'll have to be a real expert in that processor/assembly language to beat a decent compiler.但请注意，您必须成为该处理器/汇编语言的真正专家才能击败体面的编译器。

Anyway... If we want to guess, this is my guess:无论如何......如果我们想猜测，这是我的猜测：

void some_function(int *data)
{
    int  i, j, x;

    for (i = 0; i < 64; i++)
    {
        // Handle even i-values
        x = i/2;
        for (j = 0; j < 64; j += 2)
        {
            *data = x;
            ++data;
            *data = x;
            ++data;
            ++x;        // Increment after writing to data twice
        }

        ++i;

        // Handle odd i-values
        x = i/2;
        for (j = 0; j < 64; j += 2)
        {
            *data = x;
            ++data;
            ++x;        // Increment after writing to data once
            *data = x;
            ++data;
        }
    }
}

The idea is 1) to replace the array-indexing with pointer increments and 2) to replace the (i+j)/2 with integer increments.这个想法是 1) 用指针增量替换数组索引和 2) 用 integer 增量替换(i+j)/2 。

I have not done any measurement so I can't say for sure that this will be a good solution.我没有进行任何测量，所以我不能肯定这将是一个好的解决方案。 I'll leave that to OP.我会把它留给OP。

Same idea as above, but with a few more tweaks (proposed by @user3386109)与上述相同的想法，但还有一些调整（由@user3386109 提出）

void some_function(int *data)
{
    for (int i = 0; i < 32; i++)
    {
        // when i is even, the output is in matched pairs
        int value = i;
        for (int j = 0; j < 32; j++)
        {
            *data++ = value;
            *data++ = value++;
        }

        // when i is odd, the output starts with a singleton
        // followed by matched pairs, and ending with a singleton
        value = i;
        *data++ = value++;
        for (int j = 0; j < 31; j++)
        {
            *data++ = value;
            *data++ = value++;
        }
        *data++ = value;
    }
}

优化嵌套循环以填充数组，以帮助编译器生成高效的 ARM 程序集？

问题描述

2 个解决方案

解决方案1
4 2020-04-29 06:09:24

解决方案2
4 2020-04-29 06:59:37

优化嵌套循环以填充数组，以帮助编译器生成高效的 ARM 程序集？

问题描述

2 个解决方案

解决方案1 4 2020-04-29 06:09:24

解决方案2 4 2020-04-29 06:59:37

解决方案1
4 2020-04-29 06:09:24

解决方案2
4 2020-04-29 06:59:37