C 功能优化

Question

I have a function that looks about like so, constants a1-e8 (double precision floats) are in the code let's say either hardcoded or as #defined.我有一个看起来像这样的函数，常量 a1-e8（双精度浮点数）在代码中，让我们说硬编码或 #defined。 The function accepts doubles within the range of -1.0 to 1.0 and needs to be split in quarters as shown.该函数接受 -1.0 到 1.0 范围内的双精度数，并且需要按如图所示分成四等分。

Are there any other code optimizations I can make to increase runtime performance before assembly language optimization?在汇编语言优化之前，我可以进行任何其他代码优化来提高运行时性能吗？ I tried making an x2 to hold x*x and made the e constants multiply by x2*x2 but it actually slowed down performance.我尝试制作一个 x2 来保存 x*x 并使 e 常数乘以 x2*x2 但它实际上降低了性能。 I also tried seeing if I could cast a copy of x as an integer and use a switch statement but it also slowed down performance.我还尝试查看是否可以将 x 的副本转换为整数并使用 switch 语句，但它也会降低性能。

double operation(double x) {
    if (x <= -0.75 && x >= -1.0) {
        return a1 + b1*x + c1*x*x + d1*x*x*x + e1*x*x*x*x;
    }
    else if (x <= -0.5) {
        return a2 + b2*x - c2*x*x - d2*x*x*x - e2*x*x*x*x;
    }
    else if (x <= -0.25) {
        return a3 - b3*x - c3*x*x - d3*x*x*x - e3*x*x*x*x;
    }
    else if (x <= 0.0) {
        return a4 - b4*x - c4*x*x - d4*x*x*x + e4*x*x*x*x;
    }
    else if (x <= 0.25) {
        return a5 + b5*x - c5*x*x + d5*x*x*x + e5*x*x*x*x;
    }
    else if (x <= 0.5) {
        return a6 + b6*x - c6*x*x + d6*x*x*x - e6*x*x*x*x;
    }
    else if (x <= 0.75) {
        return a7 - b7*x - c7*x*x + d7*x*x*x - e7*x*x*x*x;
    }
    else if (x <= 1.0) {
        return a8 - b8*x + c8*x*x - d8*x*x*x + f8*x*x*x*x;
    }
    return 0.0;
}

Answer 1

Other than the using compile flag (-Ofast on Linux/Mint19), which will speedup performance by about 2.5 (100,000,000 calls, mostly within the range), there are few minor adjustment that can help:除了使用编译标志（Linux/Mint19 上的 -Ofast），它将使性能加速约 2.5（100,000,000 次调用，大部分在范围内），很少有小的调整可以帮助：

Replacing if with lookups.替换 if 与查找。 see below.见下文。 Reduce wasting time to find the right case.减少浪费时间寻找合适的案例。 Coefficient has been adjusted +/- based on addition/subtraction of the formula.系数已根据公式的加法/减法调整 +/-。

This will provide +25% speed.这将提供 +25% 的速度。

Original code: un-optimized: 2.154 Optimized with -Ofast: 0.678 Modified code, -Ofast: 0.581原始代码：未优化：2.154 使用 -Ofast 优化：0.678 修改后的代码，-Ofast：0.581

double operation(double x) {
    static double aa[] = { a1, a2, a3, a4, a5, a6, a7, a8 } ;
    static double bb[] = { b1, b2, -b3, -b4, b5, b6, -b7, -b8 } ;
    static double cc[] = { c1, -c2, -c3, -c4, -c5, -c6, -c7, c8 } ;
    static double dd[] = { d1, -d2, -d3, -d4, d5, d6, d7, -d8 } ;
    static double ee[] = { e1, -e2, -e3, e4, e5, -e6, -e7, e8 } ;


    if (x < -1.0 || x > 1.0) {
        return 0 ;
    }
    int p = x*4 + 4 ;
//    if ( p < 0 ) p = 1;
    return aa[p] + bb[p]*x + cc[p]*x*x + dd[p]*x*x*x + ee[p]*x*x*x*x;
}

Note: I believe original code has a minor.注意：我相信原始代码有一个未成年人。 It will use the coefficient for (x<-0.5) for any negative value <-1.对于任何 <-1 的负值，它将使用 (x<-0.5) 的系数。 I believe intention was that anything outside -1..+1 should return 0.我相信意图是 -1..+1 之外的任何东西都应该返回 0。

Answer 2

Are there any other code optimizations I can make to increase runtime performance before assembly language optimization?在汇编语言优化之前，我可以进行任何其他代码优化来提高运行时性能吗？

Rearranging the comparisons so that you're basically doing a binary search for the right case rather than a linear one speeds things up quite a bit:重新排列比较，以便您基本上对正确的情况进行二分搜索而不是线性搜索，这样可以大大加快速度：

double op2(double x) {
    if (x <= 0) {
        if (x <= -0.5) {
            if (x <= -0.75 && x >= -1.0) {
                return a1 + b1*x + c1*x*x + d1*x*x*x + e1*x*x*x*x;
            }
            return a2 + b2*x - c2*x*x - d2*x*x*x - e2*x*x*x*x;
        }
        else {
            if (x <= -0.25) {
                return a3 - b3*x - c3*x*x - d3*x*x*x - e3*x*x*x*x;
            }
            return a4 - b4*x - c4*x*x - d4*x*x*x + e4*x*x*x*x;
        }
    }
    else {
        if (x <= 0.5) {
            if (x <= 0.25) {
                return a5 + b5*x - c5*x*x + d5*x*x*x + e5*x*x*x*x;
            }
            return a6 + b6*x - c6*x*x + d6*x*x*x - e6*x*x*x*x;
        }
        else {
            if (x <= 0.75) {
                return a7 - b7*x - c7*x*x + d7*x*x*x - e7*x*x*x*x;
            }
            else if (x <= 1.0) {
                return a8 - b8*x + c8*x*x - d8*x*x*x + e8*x*x*x*x;
            }
        }
    }
    return 0.0;
}

I tested this by calling the original version ( op1 ) and my version ( op2 ) both inside the same loop with the same random input in the range [-1.0, 1.0].我通过在同一个循环中调用原始版本 ( op1 ) 和我的版本 ( op2 ) 来测试这一点，并在 [-1.0, 1.0] 范围内使用相同的随机输入。 Both functions return the same value.这两个函数返回相同的值。 Profiling the code over a hundred million iterations of the loop, I got the following results:分析超过一亿次循环迭代的代码，我得到以下结果：

So, the op2 version is a little less than twice as fast as the original.因此， op2版本的速度比原始版本快两倍不到一点。

Update:更新：

I also tested a version that maps the input to an integer and then switches on that.我还测试了一个将输入映射到整数然后打开它的版本。 That only works because the intervals are all the same size, so whereas the approach in op2 could work for arbitrary intervals, this one won't.这只有效，因为间隔都是相同的大小，所以虽然op2的方法可以用于任意间隔，但这个方法不行。 To do the mapping I add 1 to the input, to shift the input range to [0, 2.0], and then multiply by 4, to expand the range to [0, 8.0].为了进行映射，我将输入加 1，将输入范围移动到 [0, 2.0]，然后乘以 4，将范围扩大到 [0, 8.0]。 Then I convert it to int so that we can switch on it.然后我将它转换为int以便我们可以打开它。 The nice thing about a switch statement with a number of consecutive values is that the compiler can implement it as a jump table, which makes it very fast.带有多个连续值的switch语句的好处是编译器可以将它实现为一个跳转表，这使得它非常快。 The cost is that extra floating point multiplication.成本是额外的浮点乘法。 Here's the function:这是函数：

double op3(double x) {
    int c = (int)((x + 1) * 4);    // mapping from double to int
    switch (c) {
        case 0: {
            return a1 + b1*x + c1*x*x + d1*x*x*x + e1*x*x*x*x;
        }
        case 1: {
            return a2 + b2*x - c2*x*x - d2*x*x*x - e2*x*x*x*x;
        }
        case 2: {
            return a3 - b3*x - c3*x*x - d3*x*x*x - e3*x*x*x*x;
        }
        case 3: {
            return a4 - b4*x - c4*x*x - d4*x*x*x + e4*x*x*x*x;
        }
        case 4: {
            return a5 + b5*x - c5*x*x + d5*x*x*x + e5*x*x*x*x;
        }
        case 5: {
            return a6 + b6*x - c6*x*x + d6*x*x*x - e6*x*x*x*x;
        }
        case 6: {
            return a7 - b7*x - c7*x*x + d7*x*x*x - e7*x*x*x*x;
        }
        case 7: {
            return a8 - b8*x + c8*x*x - d8*x*x*x + e8*x*x*x*x;
         }
        default: {
            return 0.0;
        }
    }
}

And the results:结果：

So, op3 is a lot faster than the original op1 , but op2 is still the winner in this case.所以， op3比原来的op1快很多，但在这种情况下op2仍然是赢家。 If you had more cases, though, I think you'd eventually reach a point where the cost of mapping the input to an integer is less than the cost of the comparisons in op2 .但是，如果您有更多案例，我认为您最终会达到将输入映射到整数的成本小于op2比较成本的地步。

Looking at the three functions, you can see that the complexity of the op1 approach is O(n), where n is the number of intervals.查看三个函数，可以看到op1方法的复杂度是O(n)，其中n是区间数。 The op2 approach is O(log n), since there are log n levels of comparison needed for n intervals. op2方法是 O(log n)，因为 n 个间隔需要 log n 个级别的比较。 And the op3 approach is O(1): once you map the input to an interval, the switch statement can use a jump table to find the right case in constant time.而op3方法是 O(1)：一旦将输入映射到一个区间，switch 语句就可以使用跳转表在恒定时间内找到正确的情况。

Answer 3

using clang on a vanilla mac:在 vanilla mac 上使用 clang：

double dcos(double a, double b, double c, double d, double e, double x) {
        return a + b * x + c * x * x + d * x * x * x + e * x * x * x * x;
}

generated 10 mulsd, 4 addsd whereas:生成了 10 个 mulsd，4 个 addd 而：

double dcos(double a, double b, double c, double d, double e, double x) {
        double x2 = x * x;
        return a + b * x + c * x2 + d * x * x2 + e * x2 * x2;
}

generated 7 mulsd, 3 addsd.生成了 7 个 mulsd，3 个 addd。 It may be a little less numerically stable, but that is a difference.它在数值上可能不太稳定，但这是一个区别。 In a Quick and dirty test, it shaved about 16% off.在快速而肮脏的测试中，它减少了大约 16% 的折扣。

bfm:tmp steve$ cc -O3 m.c m2.c -o m2
bfm:tmp steve$ cc -O3 m.c m1.c -o m1
bfm:tmp steve$ time ./m1
inf

real    0m4.136s
user    0m4.100s
sys 0m0.026s
bfm:tmp steve$ time ./m2
inf

real    0m3.501s
user    0m3.475s
sys 0m0.023s

C 功能优化

问题描述

3 个解决方案

解决方案1
3 2019-11-23 06:45:42

解决方案2
3 已采纳 2019-11-23 08:29:42

解决方案3
1 2019-11-22 21:06:13

C 功能优化

问题描述

3 个解决方案

解决方案1 3 2019-11-23 06:45:42

解决方案2 3 已采纳 2019-11-23 08:29:42

解决方案3 1 2019-11-22 21:06:13

解决方案1
3 2019-11-23 06:45:42

解决方案2
3 已采纳 2019-11-23 08:29:42

解决方案3
1 2019-11-22 21:06:13