为什么Haskell在执行类似C的代码时表现如此糟糕？（至少在这种情况下）

Question

I am trying to figure out some performance issues I have been having with Haskell. 我试图弄清楚我在Haskell中遇到的一些性能问题。 As part of that, I have written a small comparison program to compare C and Haskell. 作为其中的一部分，我编写了一个小比较程序来比较C和Haskell。 Specifically, I translated the C program to Haskell with as few changes as I could. 具体来说，我尽可能少地将C程序翻译成Haskell。 The speed-measured part of the Haskell program is then written in a very imperative style. 然后，Haskell程序的速度测量部分以非常强制的方式编写。

The program makes two list of random numbers in some range, and then calculates the integral of the graph formed by simply connecting those points, with one list being x-values and one list being y-values. 程序在一定范围内生成两个随机数列表，然后通过简单地连接这些点来计算形成的图的积分，其中一个列表是x值，一个列表是y值。 Essentially, it is the trapezoidal rule . 从本质上讲，它是梯形规则。

Here are the two codes: 以下是两个代码：

main.c main.c中

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 5000000
#define maxY 1e5f/N
#define maxXgap 1

int main(){
    int i;
    float *y, *x;
    float xaccum, area;
    clock_t begin, end;
    double time_spent;

    y = (float*)malloc(sizeof(float)*N);
    x = (float*)malloc(sizeof(float)*N);

    srand(50546345); // change seed for different numbers

    //populate y and x fields with random points
    for(i = 0; i < N; i++){
        y[i] = ((float)rand())/((float)RAND_MAX)*maxY;
    }
    xaccum = 0;
    for(i = 0; i < N; i++){
        x[i] = xaccum;
        xaccum += ((float)rand())/((float)RAND_MAX)*maxXgap;
    }
    begin = clock();
    //perform a trapezoidal integration using the x y coordinates
    area = 0;
    for(i = 0; i < N-1; i++){
        area += (y[i+1]+y[i])/2*(x[i+1]-x[i]);
    }
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC * 1000;
    printf("%i points\n%f area\n%f ms\n", N, area, time_spent);
}

Main.hs Main.hs

{-# LANGUAGE BangPatterns #-}
module Main where

import Data.Array.Unboxed
import Data.Array.IO
import Data.List
import System.Random
import System.CPUTime
import Text.Printf
import Control.Exception

main :: IO ()
main = do
          (x,y) <- initArrays
          area <- time $ integrate x y
          print area

n :: Int
n = 5000000

maxY :: Float
maxY = 100000.0/(fromIntegral n)

maxXgap :: Float
maxXgap = 1

--initialize arrays with random floats
--this part is not measured in the running time (very slow)
initArrays :: IO (IOUArray Int Float, IOUArray Int Float)
initArrays = do
                y <- newListArray (0,n-1) (randomList maxY n (mkStdGen 23432))
                x <- newListArray (0,n-1) (scanl1 (+) $ randomList maxXgap n (mkStdGen 5462))
                return (x,y)

randomList :: Float -> Int -> StdGen -> [Float]
randomList max n gen = map (abs . ((*) max)) (take n . unfoldr (Just . random) $ gen)

integrate :: IOUArray Int Float -> IOUArray Int Float -> IO Float
integrate x y = iterative x y 0 0

iterative :: IOUArray Int Float -> IOUArray Int Float -> Int -> Float -> IO Float
iterative x y !i !accum = do if i == n-1
                              then return accum
                              else do x1 <- readArray x i
                                      x2 <- readArray x (i+1)
                                      y1 <- readArray y i
                                      y2 <- readArray y (i+1)
                                      iterative x y (i+1) (accum + (y2+y1)/2*(x2-x1))

time :: IO t -> IO t
time a = do
            start <- getCPUTime
            v <- a
            end <- getCPUTime
            let diff = (fromIntegral (end-start)) / (10^9)
            printf "Computation time %0.5f ms\n" (diff :: Double)
            return v

The C integration runs in about 7 ms and the Haskell integration in about 60 ms on my system. C集成在大约7毫秒内运行，Haskell集成在我的系统上大约60毫秒。 Of course the Haskell version will be slower, but I am wondering why it is this much slower. 当然Haskell版本会慢一些，但我想知道为什么它会慢得多。 Obviously there is a lot of inefficiency in the Haskell code. 显然，Haskell代码中存在很多低效率。

Why is the Haskell code so much slower? 为什么Haskell代码这么慢？ How could one fix it? 怎么能解决它？

Thanks for any answers. 谢谢你的回答。

Answer 1

Out of curiosity, I ran this with llvm: 出于好奇，我用llvm运行了这个：

ghc Test.hs -O2 -XBangPatterns -fllvm -optlo-O3 ghc Test.hs -O2 -XBangPatterns -fllvm -optlo-O3

and it took it down from 60ms to 24ms. 它从60ms降到了24ms。 Still not ideal. 仍然不理想。

So, one of the first things I'll do when I want to know why a benchmark like this is so slow, is dump the prepared core. 所以，当我想知道为什么像这样的基准这么慢时，我要做的第一件事就是抛弃准备好的核心。 That is, the core after optimisations. 也就是说，优化后的核心。

ghc Test.hs -O2 -ddump-prep -dsuppress-all -XBangPatterns > Test.hscore ghc Test.hs -O2 -ddump-prep -dsuppress-all -XBangPatterns> Test.hscore

After looking through the core, I eventually found $wa, where the iterative loop is defined. 在查看了核心之后，我最终找到了$ wa，其中定义了迭代循环。 It turns out it's making surprisingly many index bound checks. 事实证明，这令人惊讶地进行了许多索引限制检查。 See, I usually use Data.Vector.Unboxed, which has "unsafeRead" and "unsafeIndex" functions, to remove bounds checks. 请参阅，我通常使用具有“unsafeRead”和“unsafeIndex”函数的Data.Vector.Unboxed来删除边界检查。 These would be useful here. 这些在这里很有用。 Personally, I think the vector package is superior. 就个人而言，我认为矢量包是优越的。

If you look at $wa, you'll notice it's unboxing the arguments at the start: 如果你看看$ wa，你会注意到它在开始时将参数拆箱：

case w_s3o9 of _ { STUArray l_s3of u_s3oi ds1_s3ol ds2_s3oH ->
case l_s3of of wild2_s3os { I# m_s3oo ->
case u_s3oi of wild3_s3ot { I# n1_s3ov ->
case ds1_s3ol of wild4_s3oC { I# y1_s3oE ->

this looks bad, but it turns out in the recursive call it's using a specialised version, integrate_$s$wa, with unboxed integers etc. This is good. 这看起来很糟糕，但事实证明它在递归调用中使用的是专用版本，integration_ $ s $ wa，带有未装箱的整数等。这很好。

In summary, I think you should get a good improvement by using vector with unsafe indexing. 总之，我认为你应该通过使用带有不安全索引的向量来获得良好的改进。

Edit: here is a modified version with Data.Vector. 编辑：这是一个带Data.Vector的修改版本。 It runs in about 7ms. 它运行大约7毫秒。 For good vector code, I think the only slowness compared to C should be due to incomplete alias analysis. 对于良好的矢量代码，我认为与C相比唯一的缓慢应该是由于不完整的别名分析。 https://gist.github.com/amosr/6026995 https://gist.github.com/amosr/6026995

Answer 2

First, I tried your code to reproduce your findings (using GHC 7.6.3 -O2 -fllvm and gcc 4.7.2 and -O3) 首先，我尝试使用您的代码重现您的发现（使用GHC 7.6.3 -O2 -fllvm和gcc 4.7.2和-O3）

$ ./theHaskellVersion-rev1
Computation time 24.00000 ms
25008.195
[tommd@Vodka Test]$ ./theCVersion
5000000 points
25013.105469 area
10.000000 ms

So we are aiming for 10ms if the goal is to perform on par (a 60% reduction in runtime). 因此，如果目标是以平均值执行（运行时减少60％），我们的目标是10毫秒。 Looking at your code I see: 看看你的代码，我看到：

Array s are used, which are ancient and cludgy. Array ，这是古老和cludgy。 I switched to Vector . 我切换到了Vector 。
There is no worker/wrapper transformation on iterative . iterative上没有工人/包装器转换。 The change is just to make an auxillary function in a where clause that doesn't require x and y as parameters. 更改只是在不需要x和y作为参数的where子句中创建辅助函数。
Float is used even though Double often performs surprisingly better (this probably doesn't matter here). 尽管Double经常表现得更好（这可能与此无关），但仍使用Float 。

The end result is on-par with what you posted in C: 最终结果与您在C中发布的结果相符：

$ ghc -O2 so.hs -hide-package random && ./so
Computation time 11.00000 ms
24999.048783785303

为什么Haskell在执行类似C的代码时表现如此糟糕？（至少在这种情况下）

问题描述

2 个解决方案

解决方案1
11 已采纳 2013-07-18 05:46:27

解决方案2
7 2013-07-18 06:07:25

为什么Haskell在执行类似C的代码时表现如此糟糕？ （至少在这种情况下）

问题描述

2 个解决方案

解决方案1 11 已采纳 2013-07-18 05:46:27

解决方案2 7 2013-07-18 06:07:25

为什么Haskell在执行类似C的代码时表现如此糟糕？（至少在这种情况下）

解决方案1
11 已采纳 2013-07-18 05:46:27

解决方案2
7 2013-07-18 06:07:25