Haskell性能：利用分析结果和基本调优技术（消除显式递归等）而苦苦挣扎

Question

I took a bit of a long break from playing with Haskell, and I'm starting to get back in to it. 我在玩Haskell时花了很长一段时间，我开始重新开始。 I'm definitely still learning my way around the language. 我当然还在学习语言。 I've realized that one of the things that has always made me nervous/uncomfortable when writing Haskell is that I don't have a strong grasp on how to craft algorithms that are both idiomatic and performant. 我已经意识到，在编写Haskell时，让我感到紧张/不舒服的一件事就是我对如何制作兼具惯用和高效的算法没有很强的把握。 I realize that "premature optimization is the root of all evil", but similarly slow code will have to be dealt with eventually and the I just can't get rid of my preconceived notions about languages that are so high-level being super slow. 我意识到“过早的优化是所有邪恶的根源”，但同样慢的代码将不得不最终处理，我只是无法摆脱我的先入为主的语言，这些语言是如此高级别的超级慢。

So, in that vein, I started playing with test cases. 所以，在那种情况下，我开始玩测试用例。 One of them that I was working on was a naïve, straight-forward implementation of the classical 4th Order Runge-Kutta method, applied to the fairly trivial IVP dy/dt = -y; y(0) = 1 我正在研究的其中一个是对经典的4阶Runge-Kutta方法的天真，直接的实现，适用于相当微不足道的IVP dy/dt = -y; y(0) = 1 dy/dt = -y; y(0) = 1 , which gives y = e^-t . dy/dt = -y; y(0) = 1 ，得到y = e^-t 。 I wrote a completely straight forward implementation in both Haskell and C (which I'll post in a moment). 我在Haskell和C中编写了一个完全直接的实现（稍后我将发布）。 The Haskell version was incredibly succinct and gave me warm fuzzies on the inside when I looked at it, but the C version (which actually wasn't horrible to parse at all ) was over twice as fast. Haskell的版本是令人难以置信的简洁，给了我在里面温暖的毛球，当我看着它，但C版本（这实际上是不可怕在所有解析）结束了快一倍。

I realize that it isn't 100% fair to compare the performance of 2 different languages; 我意识到比较两种不同语言的表现并非百分之百; and that until the day we all die C will most likely always hold the crown as the king of performance, especially hand-optimized C code. 直到我们都死C的那一天，很可能总是拥有作为表演之王的王冠，特别是手动优化的C代码。 I'm not trying to get my Haskell implementation to run just as fast as my C implementation. 我不是试图让我的Haskell实现运行与我的C实现一样快。 But I'm pretty certain that if I was more cognizant of what I was doing then I could eek a bit more speed out of this particular Haskell implementation. 但我非常肯定，如果我更清楚自己在做什么，那么我可以从这个特定的Haskell实现中获得更高的速度。

The Haskell version was compiled with -02 under GHC 7.6.3 on OS X 10.8.4, the C version was compiled with Clang and I gave it no flags. Haskell版本在OS X 10.8.4的GHC 7.6.3下用-02编译，C版本用Clang编译，我没有给它标记。 The Haskell version averaged around 0.016 seconds when tracked with time , and the C version around 0.006 seconds. Haskell版本在跟踪time时平均约为0.016秒，而C版本约为0.006秒。

These timings take in to account the entire running time of the binary, including output to stdout, which obviously accounts for some of the overhead, but I did do some profiling on the GHC binary by recompiling with -prof -auto-all and running with +RTS -p and also looking at the GC stats with +RTS -s . 这些时间考虑了二进制文件的整个运行时间，包括输出到stdout，这显然占了一些开销，但我确实通过使用-prof -auto-all重新编译并运行了一些来对GHC二进制文件进行一些分析。 +RTS -p还用+RTS -s查看GC统计数据。 I didn't really understand all that much of what I saw, but it seemed to be that my GC wasn't out of control though could probably get reined in a little bit (5%, Productivity at ~93% User, ~85% total elapsed) and that most of the productive time was spent in the function iterateRK , which I knew would be slow when I wrote it but it wasn't immediately obvious to me how to go about cleaning it up. 我并没有真正理解我所看到的所有内容，但似乎我的GC并没有失控，尽管可能会受到一点点控制（5％，生产率约为93％用户，~85已经过去的百分比总和）并且大部分生产时间都花在了函数iterateRK ，我知道在写这个函数时会很慢但是对于我来说如何清理它并不是很明显。 I realize that I'm probably incurring a penalty in my usage of a List, both in the constant cons ing and the laziness in dumping the results to stdout. 我意识到我在使用List时可能会受到惩罚，无论是在持续cons还是将结果倾倒到stdout中的懒惰中。

What exactly am I doing wrong? 我究竟做错了什么？ What library functions or Monadic wizardry am I tragically unaware of that I could be using to clean up iterateRK ? 什么库函数或Monadic魔法我悲惨地不知道我可以用来清理iterateRK ？ What are some good resources for learning how to be a GHC profiling rockstar? 什么是学习如何成为GHC剖析摇滚明星的好资源？

RK.hs RK.hs

rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' h t y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
  where k1 = y' t y
        k2 = y' (t + h/2) (y + ((h/2) * k1))
        k3 = y' (t + h/2) (y + ((h/2) * k2))
        k4 = y' (t + h) (y + (h * k3))

iterateRK y' h t0 y0 = y0:(iterateRK y' h t1 y1)
  where t1 = t0 + h
        y1 = rk4 y' h t0 y0

main = do
  let y' t y = -y
  let h = 1e-3
  let y0 = 1.0
  let t0 = 0
  let results = iterateRK y' h t0 y0
  (putStrLn . show) (take 1000 results)

RK.c RK.c

#include<stdio.h>

#define ITERATIONS 1000

double rk4(double f(double t, double x), double h, double tn, double yn)
{
  double k1, k2, k3, k4;

  k1 = f(tn, yn);
  k2 = f((tn + h/2), yn + (h/2 * k1));
  k3 = f((tn + h/2), yn + (h/2 * k2));
  k4 = f(tn + h, yn + h * k3);

  return yn + (h/6) * (k1 + 2*k2 + 2*k3 + k4);
}

double expDot(double t, double x)
{
  return -x;
}

int main()
{
  double t0, y0, tn, yn, h, results[ITERATIONS];
  int i;

  h = 1e-3;
  y0 = 1.0;
  t0 = 0.0;
  yn = y0;

  for(i = 0; i < ITERATIONS; i++)
  {
    results[i] = yn;

    yn = rk4(expDot, h, tn, yn);
    tn += h;
  }

  for(i = 0; i < ITERATIONS; i++)
  {
    printf("%.10lf", results[i]);
    if(i != ITERATIONS - 1)
      printf(", ");
    else
      printf("\n");
  }

  return 0;
}

Answer 1

Using your program with increased size gives a stack overflow: 使用大小增加的程序会导致堆栈溢出：

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

This is probably caused by too much laziness. 这可能是由于太多的懒惰造成的。 Looking at the heap profile broken down by type, you get the following: 查看按类型细分的堆配置文件，您将获得以下内容：

Heao按类型分类

(Note: I modified your program as leftaroundabout pointed out) （注意：我修改了你的程序，左撇子指出）

This doesn't look good. 这看起来不太好。 You shouldn't require linear space for your algorithm. 您不应该为算法要求线性空间。 You seem to be holding your Double values longer than required. 您似乎持有的Double值超过了要求。 Makeing the strict solves the issue: 严格解决问题：

{-# LANGUAGE BangPatterns #-}

iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' !h !t0 !y0 = y0:(iterateRK y' h t1 y1)
  where t1 = t0 + h
        y1 = rk4 y' h t0 y0

With this modification, the new heap profile looks like this: 通过此修改，新堆配置文件如下所示：

新堆配置文件

This looks much better, the memory usage is much lower. 这看起来好多了，内存使用率要低得多。 -sstderr` also confirms that we only spend 2.5% of the total time in the garbage collector after the modification: -sstderr`还确认修改后我们只花费垃圾收集器总时间的2.5％：

%GC     time       2.5%  (2.9% elapsed)

Now, the haskell version is still about 40% slower than the C one (using user time): 现在，haskell版本仍然比C版本慢40％（使用用户时间）：

$ time ./tesths; time ./testc     
2.47e-321
./tesths  0,73s user 0,01s system 86% cpu 0,853 total
2.470328e-321
./testc  0,51s user 0,01s system 95% cpu 0,549 total

Increasing the number of iterations and using a heap-allocated array for the result storage in C lowers the difference once more: 增加迭代次数并使用堆分配的数组为C中的结果存储再次降低了差异：

time ./tesths; time ./testc
2.47e-321
./tesths  18,25s user 0,04s system 96% cpu 19,025 total
2.470328e-321
./testc  16,98s user 0,14s system 98% cpu 17,458 total

This is only a difference of about 9%. 这仅相差约9％。

But we can still do better. 但我们仍然可以做得更好。 Using the stream-fusion package, we can eliminate the list completely while still keeping the decoupling. 使用流融合包，我们可以完全消除列表，同时仍然保持解耦。 Here is the full code with that optimization included: 以下是包含该优化的完整代码：

{-# LANGUAGE BangPatterns #-}
import qualified Data.List.Stream as S

rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' !h !t !y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
  where k1 = y' t y
        k2 = y' (t + h/2) (y + ((h/2) * k1))
        k3 = y' (t + h/2) (y + ((h/2) * k2))
        k4 = y' (t + h) (y + (h * k3))

iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' h = curry $ S.unfoldr $ \(!t0, !y0) -> Just (y0, (t0 + h, rk4 y' h t0 y0))

main :: IO ()
main = do
  let y' t y = -y
  let h = 1e-3
  let y0 = 1.0
  let t0 = 0
  let results = iterateRK y' h t0 y0
  print $ S.head $ (S.drop (pred 10000000) results)

I comiled with: 我跟着：

$ ghc -O2 ./test.hs -o tesths -fllvm

Here are the timings: 以下是时间安排：

$ time ./tesths; time ./testc                    
2.47e-321
./tesths  15,85s user 0,02s system 97% cpu 16,200 total
2.470328e-321
./testc  16,97s user 0,18s system 97% cpu 17,538 total

Now we're even a bit faster than C, because we do no allocations. 现在我们甚至比C快一点，因为我们没有分配。 To do a similar transformation to the C program, we have to merge the two loops into one and loose the nice abstraction. 要对C程序进行类似的转换，我们必须将两个循环合并为一个并且松散好的抽象。 Even then, it's only as fast as haskell: 即使这样，它也只有haskell一样快：

$ time ./tesths; time ./testc
2.47e-321
./tesths  15,86s user 0,01s system 98% cpu 16,141 total
2.470328e-321
./testc  15,88s user 0,02s system 98% cpu 16,175 total

Answer 2

I think that in order to make a fair comparison, you should exclude program initialization as well as printing the output (or measure it separately). 我认为，为了进行公平的比较，您应该排除程序初始化以及打印输出（或单独测量）。 By default, Haskell uses String s which are lists of Char s and this makes output quite slow. 默认情况下，Haskell使用String s，这是Char的列表，这使得输出非常慢。 Also Haskell has a complex runtime whose initialization can bias the results a lot for such a short task. Haskell也有一个复杂的运行时，它的初始化会对这么短的任务产生很大的偏差。 You can use criterion library for that: 您可以使用标准库：

import Criterion.Main

-- ...

benchmarkIRK n =
    let y' t y = -y
        h      = 1e-3
        y0     = 1.0
        t0     = 0
    in take n (iterateRK y' h t0 y0)

benchmarkIRKPrint = writeFile "/dev/null" . show . benchmarkIRK

main = defaultMain
        [ bench "rk"      $ nf benchmarkIRK 1000
        , bench "rkPrint" $ nfIO (benchmarkIRKPrint 1000)
        ]

My measurements show that the actual computation takes something around 27 us , computing and printing takes around 350 us and running the whole program (without criterion ) takes around 30 ms . 我的测量显示实际计算需要大约27 us ，计算和打印需要大约350 us，并且运行整个程序（没有标准）大约需要30 ms 。 So the actual computation takes just 1/1000 of the whole time and printing it just 1/100. 因此，实际计算仅占整个时间的1/1000并且仅打印1/100。

You should also measure your C program similarly, excluding any startup time and distinguishing what portion of time is consumed by computing and printing. 您还应该类似地测量您的C程序，排除任何启动时间并区分计算和打印所消耗的时间部分。

Answer 3

The timings of your programs have very little to do with the languages' performance, and everything with terminal IO. 程序的时间与语言的性能和终端IO的所有内容几乎没有关系。 Remove the printing of each step (BTW, putStrLn . show ≡≡ print ) from your Haskell program, and you'll get 从你的Haskell程序中删除每一步的打印（BTW， putStrLn . show ≡≡ print ），你就会得到

$ time RK-hs $时间RK-hs
1.0 1.0

real 0m0.004s 真正的0m0.004s
user 0m0.000s 用户0m0.000s
sys 0m0.000s sys 0m0.000s

... which isn't really significant, though – 1000 steps is far to little. ......但这并不是很重要 - 1000步是很少的。 With 同

main :: IO ()
main = do
    let y' t y = -y
        h = 1e-7
        y0 = 1.0
        t0 = 0
        results = iterateRK y' h t0 y0
    print . head $ drop 10000000 results

you get 你得到

$ time RK-hs +RTS -K100M $时间RK-hs + RTS -K100M
0.36787944117145965 0.36787944117145965

real 0m0.653s 真正的0m0.653s
user 0m0.572s 用户0m0.572s
sys 0m0.076s sys 0m0.076s

while the equivalent in C has 而C中的等价物有

$ time RK-c $时间RK-c
Segmentation fault (core dumped) 分段故障（核心转储）

oh great... ...but as you see, I had to increase the stack size for the Haskell program as well. 哦，太棒了......但是如你所见，我不得不增加Haskell程序的堆栈大小。 Omitting the storage of the results in a stack-allocated array, we have 在堆栈分配的数组中省略结果的存储，我们有

$ time RK-c $时间RK-c
0.3678794412 0.3678794412

real 0m0.152s 真正的0m0.152s
user 0m0.148s 用户0m0.148s
sys 0m0.000s sys 0m0.000s

so this is indeed faster, significantly now, than the Haskell version. 所以这比Haskell版本确实更快，更显着。

When even C has memory problems storing a whole lot of intermediate results (if you put it on the stack), this is worse in Haskell: each list node has to be heap-allocated seperately, and while allocation is much faster in Haskell's garbage-collected heap than in C's heap, it's still slow. 当甚至C存在存储大量中间结果的内存问题时（如果你把它放在堆栈上），这在Haskell中更糟糕：每个列表节点必须单独进行堆分配，而在Haskell的垃圾中分配速度要快得多 - 收集堆而不是C堆，它仍然很慢。

Haskell性能：利用分析结果和基本调优技术（消除显式递归等）而苦苦挣扎

问题描述

RK.hs RK.hs

RK.c RK.c

3 个解决方案

解决方案1
13 已采纳 2013-09-02 18:51:27

解决方案2
5 2013-09-02 18:18:04

解决方案3
1 2013-09-02 18:06:17

Haskell性能：利用分析结果和基本调优技术（消除显式递归等）而苦苦挣扎

问题描述

RK.hs RK.hs

RK.c RK.c

3 个解决方案

解决方案1 13 已采纳 2013-09-02 18:51:27

解决方案2 5 2013-09-02 18:18:04

解决方案3 1 2013-09-02 18:06:17

解决方案1
13 已采纳 2013-09-02 18:51:27

解决方案2
5 2013-09-02 18:18:04

解决方案3
1 2013-09-02 18:06:17