简体   繁体   English

为什么函数的递归版本比C中的迭代版本更快?

[英]Why a recursive version of a function would be faster than an iterative one in C?

I am checking the difference between two implementations of gradient descent, my guess was that with after compiler optimization both versions of the algorithm would be equivalent. 我正在检查梯度下降的两个实现之间的区别,我的猜测是在编译器优化后,两个版本的算法都是等价的。

For my surprise, the recursive version was significantly faster. 令我惊讶的是,递归版本明显更快。 I haven't discard an actual defect on any of the versions or even in the way I am measuring the time. 我没有丢弃任何版本的实际缺陷,甚至没有丢弃我测量时间的方式。 Can you guys give me some insights please? 你能告诉我一些见解吗?

This is my code: 这是我的代码:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <stdint.h>

double f(double x)
{
        return 2*x;
}

double descgrad(double xo, double xnew, double eps, double precision)
{
//      printf("step ... x:%f Xp:%f, delta:%f\n",xo,xnew,fabs(xnew - xo));

        if (fabs(xnew - xo) < precision)
        {
                return xnew;
        }
        else
        {
                descgrad(xnew, xnew - eps*f(xnew), eps, precision);
        }
}

double descgraditer(double xo, double xnew, double eps, double precision)
{
        double Xo = xo;
        double Xn = xnew;

        while(fabs(Xn-Xo) > precision)
        {
                //printf("step ... x:%f Xp:%f, delta:%f\n",Xo,Xn,fabs(Xn - Xo));
                Xo = Xn;
                Xn = Xo - eps * f(Xo);
        }

        return Xn;
}

int64_t timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
  return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
           ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}

int main()
{
        struct timespec s1, e1, s2, e2;

        clock_gettime(CLOCK_MONOTONIC, &s1);
        printf("Minimum : %f\n",descgraditer(100,99,0.01,0.00001));
        clock_gettime(CLOCK_MONOTONIC, &e1);

        clock_gettime(CLOCK_MONOTONIC, &s2);
        printf("Minimum : %f\n",descgrad(100,99,0.01,0.00001));
        clock_gettime(CLOCK_MONOTONIC, &e2);

        uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
        uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;

        printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));

        printf("End. \n");
}

I am compiling with gcc 4.5.2 on Ubuntu 11.04 with the following options: gcc grad.c -O3 -lrt -o dg 我正在使用以下选项在Ubuntu 11.04上使用gcc 4.5.2进行编译:gcc grad.c -O3 -lrt -o dg

The output of my code is: 我的代码输出是:

Minimum : 0.000487
Minimum : 0.000487
time_iter:127 ms, time_rec:19 ms, ratio (dif1/dif2) :6.68421
End.

I read a thread which also ask about a recursive version of an algorithm being faster than the iterative one. 我读了一个线程,它也询问算法的递归版本比迭代版本更快。 The explanation over there was that being the recursive version using the stack and the other version using some vectors the access on the heap was slowing down the iterative version. 对那里的解释是,使用堆栈的递归版本和使用一些向量的其他版本,堆上的访问正在减慢迭代版本。 But in this case (in the best of my understanding) I am just using the stack on both cases. 但在这种情况下(据我所知),我只是在两种情况下使用堆栈。

Am I missing something? 我错过了什么吗? Anything obvious that I am not seeing? 有什么明显我没看到的吗? Is my way of measuring time wrong? 我测量时间的方式错了吗? Any insights? 任何见解?

EDIT: Mystery solved in a comment. 编辑:在评论中解决了神秘。 As @TonyK said the initialization of the printf was slowing down the first execution. 正如@TonyK所说,printf的初始化减慢了第一次执行的速度。 Sorry that I missed that obvious thing. 对不起,我错过了那个明显的事情。

BTW, The code compiles just right without warnings. 顺便说一句,代码编译恰到好处而没有警告。 I don't think the "return descgrad(.." is necessary since the stop condition is happening before. 我不认为“返回descgrad(..”是必要的,因为停止条件发生之前。

I've compiled and run your code locally. 我已在本地编译和运行您的代码。 Moving the printf outside of the timed block makes both versions execute in ~5ms every time. printf移动到定时块之外使得两个版本每次执行约5ms。

So a central mistake in your timing is that you measure the complex beast of printf and its runtime dwarfs the code you are actually trying to measure. 因此,你的计时中的一个核心错误就是你测量了printf的复杂野兽,它的运行时使你实际想要测量的代码相形见绌。

My main() -function now looks like this: 我的main()函数现在看起来像这样:

int main() {
    struct timespec s1, e1, s2, e2;

    double d = 0.0;

    clock_gettime(CLOCK_MONOTONIC, &s1);
    d = descgraditer(100,99,0.01,0.00001);
    clock_gettime(CLOCK_MONOTONIC, &e1);
    printf("Minimum : %f\n", d);

    clock_gettime(CLOCK_MONOTONIC, &s2);
    d = descgrad(100,99,0.01,0.00001);
    clock_gettime(CLOCK_MONOTONIC, &e2);
    printf("Minimum : %f\n",d);

    uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
    uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;

    printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));

    printf("End. \n");
}

Is my way of measuring time wrong? 我测量时间的方式错了吗?

Yes. 是。 In the short timespans you are measuring, the scheduler can have a massive impact on your program. 在您测量的短时间范围内,调度程序可能会对您的程序产生巨大影响。 You need to either make your test much longer to average such differences out, or to use CLOCK_PROCESS_CPUTIME_ID instead to measure the CPU time used by your process. 您需要更长时间地进行测试以平均这些差异,或者使用CLOCK_PROCESS_CPUTIME_ID来测量进程使用的CPU时间。

For one thing, your recursive step misses a return : 首先,你的递归步骤错过了一个return

double descgrad(double xo, double xnew, double eps, double precision)
{
    if (fabs(xnew - xo) < precision)
        return xnew;
    else
        descgrad(xnew, xnew - eps*f(xnew), eps, precision);
}

Should be: 应该:

double descgrad(double xo, double xnew, double eps, double precision)
{
    if (fabs(xnew - xo) < precision)
        return xnew;
    else
        return descgrad(xnew, xnew - eps*f(xnew), eps, precision);
}

This oversight causes the return value of descgrad to be undefined, so the compiler barely has to generate code for it at all ;) 这种疏忽导致descgrad的返回值未定义,因此编译器几乎descgrad为它生成代码;)

For starters, you were including a printf in the time you were trying to measure. 对于初学者来说,在您尝试测量的时候,您包含了printf。 It's always a giant no-no because it can, and most likely will, suspend your process while doing the console output. 它始终是一个巨大的禁忌,因为它可以,并且很可能会在执行控制台输出时暂停您的进程。 Actually doing ANY system call can completely throw off time measurements like these. 实际上,进行任何系统调用都可以完全摒弃这些时间测量。

And secondly, as someone else mentioned, on this short of a sampling period, scheduler interrupts can have a huge impact. 其次,正如其他人提到的那样,在这个短暂的采样周期内,调度程序中断会产生巨大的影响。

This is not perfect, but try this for your main and you'll see there's actually very little difference. 这不是完美的,但尝试这个为你的主要,你会发现实际上差别很小。 As you increase the loop count, the ratio approaches 1.0. 随着循环次数的增加,比率接近1.0。

#define LOOPCOUNT 100000
int main() 
{
    struct timespec s1, e1, s2, e2;
    int i;
    clock_gettime(CLOCK_MONOTONIC, &s1);
    for(i=0; i<LOOPCOUNT; i++)
    {
      descgraditer(100,99,0.01,0.00001);
    }
    clock_gettime(CLOCK_MONOTONIC, &e1);

    clock_gettime(CLOCK_MONOTONIC, &s2);
    for(i=0; i<LOOPCOUNT; i++)
    {
      descgrad(100,99,0.01,0.00001);
    }
    clock_gettime(CLOCK_MONOTONIC, &e2);

    uint64_t dif1 = timespecDiff(&e1,&s1) / 1000;
    uint64_t dif2 = timespecDiff(&e2,&s2) / 1000;

    printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2)));

    printf("End. \n");

} }

EDIT: After looking at the disassembled output using objdump -dS I noticed a few things: 编辑:在使用objdump -dS查看反汇编输出后,我注意到了一些事情:
With -O3 optimization, the above code optimizes the function call away completely. 通过-O3优化,上面的代码完全优化了函数调用。 However, it does still produce code for the two functions and neither is actually recursive. 但是,它仍然会为这两个函数生成代码,实际上它们都不是递归的。

Secondly, with -O0, such that the resulting code is actually recursive, the recursive version is literally a trillion times slower. 其次,使用-O0,使得结果代码实际上是递归的,递归版本实际上慢了一万亿倍。 My guess is because the call stack forces variables to end up in memory where the iterative version runs out of registers and/or cache. 我的猜测是因为调用堆栈强制变量最终在内存中,其中迭代版本用完寄存器和/或缓存。

The accepted answer is incorrect . 接受的答案是不正确的

There IS a difference in the runtimes of the iterative function and the recursive function and the reason is the compiler optimization -foptimize-sibling-calls added by -O3 . 迭代函数和递归函数的运行时存在差异,原因是-O3添加的编译器优化-foptimize-sibling-calls

First, the code: 一,代码:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <stdint.h>

double descgrad(double xo, double xnew, double eps, double precision){
        if (fabs(xnew - xo) <= precision) {
                return xnew;
        } else {
                return descgrad(xnew, xnew - eps*2*xnew, eps, precision);
        }
}

double descgraditer(double xo, double xnew, double eps, double precision){
        double Xo = xo;
        double Xn = xnew;

        while(fabs(Xn-Xo) > precision){
                Xo = Xn;
                Xn = Xo - eps * 2*Xo;
        }
        return Xn;
}

int main() {
        time_t s1, e1, d1, s2, e2, d2;
        int i, iter = 10000000;
        double a1, a2;

        s1 = time(NULL);
        for( i = 0; i < iter; i++ ){
            a1 = descgraditer(100,99,0.01,0.00001);
        }
        e1 = time(NULL);
        d1 = difftime( e1, s1 );

        s2 = time(NULL);
        for( i = 0; i < iter; i++ ){
            a2 = descgrad(100,99,0.01,0.00001);
        }
        e2 = time(NULL);
        d2 = difftime( e2, s2 );

    printf( "time_iter: %d s, time_rec: %d s, ratio (iter/rec): %f\n", d1, d2, (double)d1 / d2 ) ;
    printf( "return values: %f, %f\n", a1, a2 );
}

Previous posts were correct in pointing out that you need to iterate many times in order to average away environment interference. 以前的帖子是正确的,指出你需要多次迭代才能平均环境干扰。 Given that, I discarded your differencing function in favor of time.h 's difftime function on time_t data, since over many iterations, anything finer than a second is meaningless. 鉴于此,我放弃了你的差分函数,支持time.htime_t数据上的difftime函数,因为经过多次迭代,任何比一秒更精细的东西都是没有意义的。 In addition, I removed the printfs in the benchmark. 另外,我删除了基准测试中的printfs。

I also fixed a bug in the recursive implementation. 我还修复了递归实现中的错误。 Your original code's if-statement checked for fabs(xnew-xo) < precision , which is incorrect (or at least different from the iterative implementation). 您的原始代码的if语句检查fabs(xnew-xo) < precision ,这是不正确的(或者至少与迭代实现不同)。 The iterative loops while fabs() > precision, therefore the recursive function should not recurse when fabs <= precision. 当fabs()> precision时迭代循环,因此当fabs <= precision时,递归函数不应该递归。 Adding 'iteration' counters to both functions confirms that this fix makes the function logically equivalent. 将“迭代”计数器添加到两个函数确认此修复使逻辑上的函数等效。

Compiling and running with -O3 : 使用-O3编译和运行:

$ gcc test.c -O3 -lrt -o dg
$ ./dg
time_iter: 34 s, time_rec: 0 s, ratio (iter/rec): inf
return values: 0.000487, 0.000487

Compiling and running without -O3 在没有-O3情况下编译和运行

$ gcc test.c -lrt -o dg
$ ./dg
time_iter: 54 s, time_rec: 90 s, ratio (iter/rec): 0.600000
return values: 0.000487, 0.000487

Under no optimization, iteration performs BETTER than recursion. 在没有优化的情况下,迭代执行比递归更好。

Under -O3 optimization, however, recursion runs ten-million iterations in under a second. 然而,在-O3优化下,递归在一秒钟内运行了一千万次迭代。 The reason is that it adds -foptimize-sibling-calls , which optimizes sibling and tail recursive calls, which is exactly what your recursive function is taking advantage of. 原因是它添加了-foptimize-sibling-calls ,它优化了兄弟和尾递归调用,这正是你的递归函数正在利用的。

To be sure, I ran it will all -O3 optimizations EXCEPT -foptimize-sibling-calls : 可以肯定的是,我运行它将全部-O3优化EXCEPT -foptimize-sibling-calls

$ gcc test.c -lrt -o dg  -fcprop-registers  -fdefer-pop -fdelayed-branch  -fguess-branch-probability -fif-conversion2 -fif-conversion -fipa-pure-const -fipa-reference -fmerge-constants   -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-sra -ftree-ter -funit-at-a-time -fthread-jumps -falign-functions  -falign-jumps -falign-loops  -falign-labels -fcaller-saves -fcrossjumping -fcse-follow-jumps  -fcse-skip-blocks -fdelete-null-pointer-checks -fexpensive-optimizations -fgcse  -fgcse-lm  -fpeephole2 -fregmove -freorder-blocks  -freorder-functions -frerun-cse-after-loop  -fsched-interblock  -fsched-spec -fschedule-insns  -fschedule-insns2 -fstrict-aliasing  -ftree-pre -ftree-vrp -finline-functions -funswitch-loops  -fgcse-after-reload -ftree-vectorize
$ ./dg
time_iter: 55 s, time_rec: 89 s, ratio (iter/rec): 0.617978
return values: 0.000487, 0.000487

Recursion, without the tail-call optimization, performs worse than iteration, in the same way as when compiled with NO optimization. 没有尾调用优化的递归比迭代更糟糕,与使用NO优化编译时相同。 Read about compiler optimizations here . 阅读有关编译器优化的信息

EDIT: 编辑:

As a verification of correctness I updated my code include the return values. 作为验证正确性我更新了我的代码包括返回值。 Also, I set two static variables to 0 and incremented each on recursion and iteration to verify correct output: 另外,我将两个静态变量设置为0,并在递归和迭代时逐递增,以验证输出是否正确:

int a = 0;
int b = 0;

double descgrad(double xo, double xnew, double eps, double precision){
        if (fabs(xnew - xo) <= precision) {
                return xnew;
        } else {
                a++;
                return descgrad(xnew, xnew - eps*2*xnew, eps, precision);
        }
}

double descgraditer(double xo, double xnew, double eps, double precision){
        double Xo = xo;
        double Xn = xnew;

        while(fabs(Xn-Xo) > precision){
                b++;
                Xo = Xn;
                Xn = Xo - eps * 2*Xo;
        }
        return Xn;
}

int main() {
    time_t s1, e1, d1, s2, e2, d2;
    int i, iter = 10000000;
    double a1, a2;

    s1 = time(NULL);
    for( i = 0; i < iter; i++ ){
        a1 = descgraditer(100,99,0.01,0.00001);
    }
    e1 = time(NULL);
    d1 = difftime( e1, s1 );

    s2 = time(NULL);
    for( i = 0; i < iter; i++ ){
        a2 = descgrad(100,99,0.01,0.00001);
    }
    e2 = time(NULL);
    d2 = difftime( e2, s2 );

    printf( "time_iter: %d s, time_rec: %d s, ratio (iter/rec): %f\n", d1, d2, (double)d1 / d2 ) ;
    printf( "return values: %f, %f\n", a1, a2 );
    printf( "number of recurs/iters: %d, %d\n", a, b );
}

The output: 输出:

$ gcc optimization.c -O3 -lrt -o dg
$ ./dg
time_iter: 41 s, time_rec: 24 s, ratio (iter/rec): 1.708333
return values: 0.000487, 0.000487
number of recurs/iters: 1755032704, 1755032704

The answers are the same, and the repetition is the same. 答案是一样的,重复是一样的。

Also interesting to note, the static variable fetching/incrementing has a considerable impact on the tail-call optimization, however recursion still out-performs iteration. 另外值得注意的是,静态变量提取/递增对尾调用优化有相当大的影响,但是递归仍然超出迭代次数。

First, clock_gettime seems to be measuring wall clock time, not execution time. 首先, clock_gettime似乎是测量挂钟时间,而不是执行时间。 Second, the actual time you're measuring is the execution time of printf , not the execution time of your function. 其次,您测量的实际时间是printf的执行时间,而不是printf的执行时间。 And third, the first time you call printf , it isn't in memory, so it has to be paged in, involving significant disk IO. 第三,第一次调用printf ,它不在内存中,因此必须进行分页,涉及重要的磁盘IO。 Inverse the order you run the tests, and the results will inverse as well. 反转运行测试的顺序,结果也会反转。

If you want to get some significant measurements, you have to make sure that 如果您想获得一些重要的测量值,您必须确保这一点

  1. only the code you want to measure is in the measurement loops, or at least, the additional code is very minimum compared to what you're measuring, 只有您要测量的代码在测量循环中,或者至少,与您测量的相比,附加代码非常小
  2. you do something with the results, so that the compiler can't optimize all of the code out (not a problem in your tests), 你对结果做了些什么,这样编译器就无法优化所有代码(在测试中不是问题),
  3. you execute the code to be measured a large number of times, taking the average, 你执行要测量的代码很多次,取平均值,
  4. you measure CPU time, and not wall clock time, and 你测量CPU时间,而不是挂钟时间,和
  5. you make sure that everything is paged in before starting the measurements. 在开始测量之前,确保所有内容都被分页。

In many cases on modern hardware cache misses are the limiting factor of performance for small loop constructs. 在许多情况下,现代硬件缓存未命中是小循环结构的性能限制因素。 A recursive implementation is less likely to create cache misses on the instruction path. 递归实现不太可能在指令路径上创建高速缓存未命中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM