优化并重写以下C代码

Question

这是一个教科书问题，涉及重写一些C代码，使其在给定的处理器架构上表现最佳。

给定：针对具有4个加法器和2个乘法器单元的单个超标量处理器。

输入结构（在别处初始化）：

struct s {
    short a;
    unsigned v;
    short b;
} input[100];

以下是对此数据进行操作的例程。 显然必须确保正确性，但目标是优化它的废话。

int compute(int x, int *r, int *q, int *p) {

    int i;
    for(i = 0; i < 100; i++) {

        *r *= input[i].v + x;
        *p = input[i].v;
        *q += input[i].a + input[i].v + input[i].b;
    }

    return i;
}

所以这个方法有3个算术指令来更新整数r，q，p。

这是我尝试用评论解释我在想什么：

//Use temp variables so we don't keep using loads and stores for mem accesses; 
//hopefully the temps will just be kept in the register file
int r_temp = *r;
int q_temp = *q;

for (i=0;i<99;i = i+2) {
    int data1 = input[i];
    int data2 = input[i+1]; //going to try partially unrolling loop
    int a1 = data1.a;
    int a2 = data2.a;
    int b1 = data1.b;
    int b2 = data2.b;
    int v1 = data1.v;
    int v2 = data2.v;

    //I will use brackets to make my intention clear the order of operations I was planning
    //with respect to the functional (adder, mul) units available

    //This is calculating the next iteration's new q value 
    //from q += v1 + a1 + b1, or q(new)=q(old)+v1+a1+b1

    q_temp = ((v1+q1)+(a1+b1)) + ((a2+b2)+v2);
    //For the first step I am trying to use a max of 3 adders in parallel, 
    //saving one to start the next computation

    //This is calculating next iter's new r value 
    //from r *= v1 + x, or r(new) = r(old)*(v1+x)

    r_temp = ((r_temp*v1) + (r_temp*x)) + (v2+x);
}
//Because i will end on i=98 and I only unrolled by 2, I don't need to 
//worry about final few values because there will be none

*p = input[99].v; //Why it's in the loop I don't understand, this should be correct
*r = r_temp;
*q = q_temp;

好的，我的解决方案给了我什么？ 查看旧代码，i的每个循环迭代将采用max（（1A + 1M），（3A））的最小延迟，其中前一个值用于计算新r，而3个Adds的延迟是q。

在我的解决方案中，我正在展开2并尝试计算r和q的第二个新值。 假设加法器/乘法器的延迟是M = c * A（c是某个整数> 1），r的乘法运算肯定是限速步骤，所以我专注于此。 我尽可能多地并行使用乘数。

在我的代码中，首先并行使用2个乘法器以帮助计算中间步骤，然后add必须组合这些，然后使用最终乘法来获得最后的结果。 因此对于r的2个新值（即使我只保留/关心最后一个），它需要我（1M // 1M // 1A）+ 1A + 1M，总延迟为2M + 1M。 除以2，我的每个循环的延迟值为1M + 0.5A 。 我计算原始延迟/值（对于r）为1A + 1M。 所以如果我的代码是正确的（我手动完成了这些，还没有测试过！）那么我的性能提升很小。

另外，希望不要直接在循环中访问和更新指针（主要是由于临时变量r_temp和q_temp），我节省了一些加载/存储延迟。

那是我的捅。 绝对有兴趣看到其他人提出的更好！

Answer 1

是的，可以利用这两条短裤。 像这样重新排列结构

struct s {
    unsigned v;
    short a;
    short b;
} input[100];

并且您可能能够更好地对齐体系结构上的内存字段，这可能允许更多这些结构位于同一内存页面中，这可能允许您遇到更少的内存页面错误。

这都是推测性的，这就是为什么描述这么重要的原因。

如果您拥有正确的体系结构，重新排列将为您提供更好的数据结构对齐，从而在内存中产生更高的数据密度，因为必需的填充丢失的位数更少，以确保类型与公共内存体系结构强加的数据边界对齐。

优化并重写以下C代码

问题描述

1 个解决方案

解决方案1
3 2012-09-11 18:42:57

优化并重写以下C代码

问题描述

1 个解决方案

解决方案1 3 2012-09-11 18:42:57

解决方案1
3 2012-09-11 18:42:57