简体   繁体   English

为什么我的程序这么慢?

[英]Why is my program so slow?

Someone decided to do a quick test to see how native client compared to javascript in terms of speed. 有人决定做一个快速测试,看看本机客户端在速度方面与javascript的比较。 They did that by running 10 000 000 sqrt calculations and measuring the time it took. 他们通过运行10 000 000 sqrt计算并测量所花费的时间来做到这一点。 The result with javascript: 0.096 seconds, and with NaCl: 4.241 seconds... How can that be? 结果用javascript:0.096秒,用NaCl:4.241秒......怎么会这样? Isn't speed one of the reasons to use NaCl in the first place? 速度不是首先使用NaCl的原因之一吗? Or am i missing some compiler flags or something? 或者我错过了一些编译器标志或什么?

Heres the code that was run: 下面是运行的代码:

clock_t t = clock();
float result = 0;
for(int i = 0; i < 10000000; ++i) {
    result += sqrt(i);
}
t = clock() - t;      
float tt = ((float)t)/CLOCKS_PER_SEC;
pp::Var var_reply = pp::Var(tt);
PostMessage(var_reply);

PS: This question is an edited version of something that appeared in the native client mailing list PS:这个问题是出现在本机客户端邮件列表中的某些内容的编辑版本

NOTE: This answer is an edited version of something that appeared in the native client mailing list 注意:此答案是出现在本机客户端邮件列表中的某些内容的编辑版本

Microbenchmarks are tricky: unless you understand what you are doing VERY well it's easy to produce apples-to-oranges comparisons which are not relevant to the behavior you want to observe/measure at all. 微量标记是棘手的:除非你非常了解你正在做什么,否则很容易产生与你想要观察/测量的行为无关的苹果到橙子的比较。

I'll elaborate a bit using your own example (I'll exclude NaCl and stick to the existing, "tried and true" technologies). 我将使用您自己的示例详细说明(我将排除NaCl并坚持使用现有的“经过验证的”技术)。

Here is your test as native C program: 这是您作为本机C程序的测试:

$ cat test1.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += sqrt(i);
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g %g\n", result, tt);

}
$ gcc -std=c99 -O2 test1.c -lm -o test1
$ ./test1
5.49756e+11 25.43

Ok. 好。 We can do billion cycles in 25.43 seconds. 我们可以在25.43秒内完成十亿次循环。 But let's see what takes time: let's replace "result += sqrt(i);" 但是让我们看看需要花费时间:让我们替换“结果+ = sqrt(i);” with "result += i;" 用“结果+ = i;”

$ cat test2.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += i;
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g %g\n", result, tt);
}
$ gcc -std=c99 -O2 test2.c -lm -o test2
$ ./test2
1.80144e+16 1.21

Wow! 哇! 95% of time was actually spend in CPU-provided sqrt function, everything else took less then 5%. 95%的时间实际上花费在CPU提供的sqrt函数上,其他一切都花费不到5%。 But what if we'll change the code just a bit: replace "printf("%g %g\\n", result, tt);" 但是如果我们稍微改变一下代码怎么办:替换“printf(”%g%g \\ n“,result,tt);” with "printf("%g\\n", tt);" “printf(”%g \\ n“,tt);” ?

$ cat test3.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += sqrt(i);
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g\n", tt);
}
$ gcc -std=c99 -O2 test3.c -lm -o test3
$ ./test
1.44

Hmm... Looks like now "sqrt" is almost as fast as "+". 嗯......看起来现在“sqrt”几乎和“+”一样快。 How can this be? 怎么会这样? How can printf affect the previous cycle AT ALL? printf如何影响前一个循环AT ALL?

Let's see: 让我们来看看:

$ gcc -std=c99 -O2 test1.c -S -o -
...
.L3:
        cvtsi2sd        %ebp, %xmm1
        sqrtsd  %xmm1, %xmm0
        ucomisd %xmm0, %xmm0
        jp      .L7
        je      .L2
.L7:
        movapd  %xmm1, %xmm0
        movss   %xmm2, (%rsp)
        call    sqrt
        movss   (%rsp), %xmm2
.L2:
        unpcklps        %xmm2, %xmm2
        addl    $1, %ebp
        cmpl    $1000000000, %ebp
        cvtps2pd        %xmm2, %xmm2
        addsd   %xmm0, %xmm2
        unpcklpd        %xmm2, %xmm2
        cvtpd2ps        %xmm2, %xmm2
        jne     .L3
 ...
$ gcc -std=c99 -O2 test3.c -S -o -
...
        xorpd   %xmm1, %xmm1
...
.L5:
        cvtsi2sd        %ebp, %xmm0
        ucomisd %xmm0, %xmm1
        ja      .L14
.L10:
        addl    $1, %ebp
        cmpl    $1000000000, %ebp
        jne     .L5
...
.L14:
        sqrtsd  %xmm0, %xmm2
        ucomisd %xmm2, %xmm2
        jp      .L12
        .p2align 4,,2
        je      .L10
.L12:
        movsd   %xmm1, (%rsp)
        .p2align 4,,5
        call    sqrt
        movsd   (%rsp), %xmm1
        .p2align 4,,4
        jmp     .L10
...

First version actually calls sqrt billion times, but second one does not do that at all! 第一个版本实际上调用了sqrt十亿次,但第二个版本根本没有这样做! Instead it checks if the number is negative and calls sqrt only in this case! 相反,它检查数字是否为负数,并且仅在这种情况下调用sqrt! Why? 为什么? What the compiler (or, rather, compiler authors) are trying to do here? 编译器(或编译器作者)在这里尝试做什么?

Well, it's simple: since we've not used "result" in this particular version it can safely omit "sqrt" call... if the value is not negative, that is! 嗯,这很简单:既然我们没有在这个特定版本中使用“结果”,它可以安全地省略“sqrt”调用...如果值不是负数,那就是! If it's negative then (depending on FPU flags) sqrt can do different things (return nonsensical result, crash the program, etc). 如果它是负的那么(取决于FPU标志)sqrt可以做不同的事情(返回无意义的结果,崩溃程序等)。 That's why this version is dozen of times faster - but it does not calculate square roots at all! 这就是为什么这个版本的速度要快十几倍 - 但它根本不计算平方根!

Here is final example which shows how wrong microbenchmarks can go: 这是最后一个示例,显示了错误的微基准测试的结果:

$ cat test4.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  int result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += 2;
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%d %g\n", result, tt);
}
$ gcc -std=c99 -O2 test4.c -lm -o test4
$ ./test4
2000000000 0

Execution time is... ZERO? 执行时间是...... ZERO? How can it be? 怎么会这样? Billion calculations in less then blink of eye? 亿万计算在少于眨眼之间? Let's see: 让我们来看看:

$ gcc -std=c99 -O2 test1.c -S -o -
...
        call    clock
        movq    %rax, %rbx
        call    clock
        subq    %rbx, %rax
        movl    $2000000000, %edx
        movl    $.LC1, %esi
        cvtsi2ssq       %rax, %xmm0
        movl    $1, %edi
        movl    $1, %eax
        divss   .LC0(%rip), %xmm0
        unpcklps        %xmm0, %xmm0
        cvtps2pd        %xmm0, %xmm0
...

Uh, oh, cycle is completely eliminated! 呃,哦,循环完全消除了! All calculations happened at compile time and to add insult to injury both "clock" calls were executed before body of the cycle to boot! 所有的计算都发生在编译时,为了加重侮辱,“时钟”调用都是在循环体循环之前执行的!

What if we'll put it in separate function? 如果我们将它放在单独的功能中怎么办?

$ cat test5.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int testfunc(int num, int max) {
  int result = 0;
  for(int i = 0; i < max; ++i) {
      result += num;
  }
  return result;
}

int main() {
  clock_t t = clock();
  int result = testfunc(2, 1000000000);
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%d %g\n", result, tt);
}
$ gcc -std=c99 -O2 test5.c -lm -o test5
$ ./test5
2000000000 0

Still the same??? 还是一样??? How can this be? 怎么会这样?

$ gcc -std=c99 -O2 test5.c -S -o -
...
.globl testfunc
        .type   testfunc, @function
testfunc:
.LFB16:
        .cfi_startproc
        xorl    %eax, %eax
        testl   %esi, %esi
        jle     .L3
        movl    %esi, %eax
        imull   %edi, %eax
.L3:
        rep
        ret
        .cfi_endproc
...

Uh-oh: compiler is clever enough to replace cycle with a multiplication! 呃哦:编译器很聪明,用乘法代替循环!

Now if you'll add NaCl on one side and JavaScript on the other side you'll get such a complex system that results are literally unpredictable. 现在,如果你在一边添加NaCl而在另一边添加JavaScript,你会得到一个如此复杂的系统,结果实际上是不可预测的。

The problem here is that for microbenchmark you are trying to isolate piece of code and then evaluate it's properties, but then compiler (no matter JIT or AOT) will try to thwart your efforts because it tries to remove all the useless calculations from your program! 这里的问题是,对于microbenchmark,你试图隔离一段代码,然后评估它的属性,但是编译器(无论是JIT还是AOT)都会试图阻止你的努力,因为它试图从你的程序中删除所有无用的计算!

Microbenchmarks useful, sure, but they are FORENSIC ANALYSIS tool, not something you want to use to compare speed of two different systems! Microbenchmarks很有用,当然,但它们是FORENSIC ANALYSIS工具,而不是你想要用来比较两个不同系统的速度的东西! For that you need some "real" (in some sense of the world: something which can not be optimized to pieces by over-eager compiler) workload: sorting algorithms are popular, in particular. 为此,您需要一些“真实的”(在某种意义上的世界:某些东西无法通过过度热衷的编译器对其进行优化)工作负载:特别是排序算法很受欢迎。

Benchmarks which use sqrt are especially nasty because, as we've seen, usually they spend over 90% of time executing one single CPU instruction: sqrtsd (fsqrt if it's 32-bit version) which is, of course, identical for JavaScript and NaCl. 使用sqrt的基准测试尤其令人讨厌,因为正如我们所见,通常他们花费超过90%的时间执行单个CPU指令:sqrtsd(fsqrt,如果它是32位版本),当然,JavaScript和NaCl相同。 These benchmarks (if properly implemented) may serve as a litmus test (if speed of some implementation differs too much from what simple native version exhibits then you are doing something wrong), but they are useless as comparison of speeds of NaCl, JavaScript, C# or Visual Basic. 这些基准(如果正确实施)可以作为试金石(如果某些实现的速度与简单的本机版本的差异太大,那么你做错了什么),但它们作为NaCl,JavaScript,C#的速度比较是无用的或Visual Basic。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM