简体   繁体   English

什么是可能的Haskell优化键?

[英]What are possible Haskell optimizations keys?

I found benchmark that solves really simple task in different languages https://github.com/starius/lang-bench . 我发现基准测试解决了不同语言中的非常简单的任务https://github.com/starius/lang-bench Here 's the code for Haskell : 这是Haskell的代码:

cmpsum i j k =
    if i + j == k then 1 else 0

main = print (sum([cmpsum i j k |
    i <- [1..1000], j <- [1..1000], k <- [1..1000]]))

This code runs very slow as you can see in benchmark and I found this very strange. 你可以在基准测试中看到这个代码运行速度非常慢,我觉得这很奇怪。 I tried to inline the function cmpsum and compile with the next flags: 我试图内联函数cmpsum并使用下一个标志进行编译:

ghc -c -O2 main.hs

but it really didn't help. 但它确实没有帮助。 I am not asking about optimizing the algorithm cause it's the same for all languages, but about possible compiler or code optimizations that can make this code run faster. 我不是要求优化算法,因为它对所有语言都是一样的,但是关于可能使编码或代码优化可以使这段代码运行得更快的问题。

Not a complete answer, sorry. 不是完整的答案,对不起。 Compiling with GHC 7.10 on my machine I get ~12s for your version. 在我的机器上使用GHC 7.10进行编译我的版本为~12s。

I'd suggest always compiling with -Wall which shows us that our numbers are being defaulted to the infinite precision Integer type. 我建议总是使用-Wall进行编译,这表明我们的数字默认为无限精度的Integer类型。 Fixing that: 修复:

module Main where

cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k =
    if i + j == k then 1 else 0

main :: IO ()
main = print (sum([cmpsum i j k |
    i <- [1..1000], j <- [1..1000], k <- [1..1000]]))

This runs in ~5s for me. 这对我来说大约需要5秒。 Running with +RTS -s seems to show we have a loop in constant memory: 使用+RTS -s运行似乎表明我们在常量内存中有一个循环:

          87,180 bytes allocated in the heap
           1,704 bytes copied during GC
          42,580 bytes maximum residency (1 sample(s))
          18,860 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    4.920s  (  4.919s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    4.920s  (  4.921s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    17,719 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

-fllvm shaves off another second or so. -fllvm了另一秒左右。 Maybe someone else can look into it further. 也许其他人可以进一步研究它。

Edit : Just digging into this a little further. 编辑 :再深入挖掘一下这个。 It doesn't look like fusion is happening. 它看起来并不像融合正在发生。 Even if I change sum to a foldr (+) 0 which is an explicit "good producer/good consumer" pair. 即使我将sum更改为foldr (+) 0 ,这是一个明确的“良好生产者/良好消费者”对。

Rec {
$wgo [InlPrag=[0], Occ=LoopBreaker] :: Int# -> Int#
[GblId, Arity=1, Str=DmdType <S,U>]
$wgo =
  \ (w :: Int#) ->
    let {
      $j :: Int# -> Int#
      [LclId, Arity=1, Str=DmdType]
      $j =
        \ (ww [OS=OneShot] :: Int#) ->
          letrec {
            $wgo1 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
            [LclId, Arity=1, Str=DmdType <S,1*U>]
            $wgo1 =
              \ (w1 :: [Int]) ->
                case w1 of _ [Occ=Dead] {
                  [] -> ww;
                  : y ys ->
                    case $wgo1 ys of ww1 { __DEFAULT ->
                    case lvl of _ [Occ=Dead] {
                      [] -> ww1;
                      : y1 ys1 ->
                        case y of _ [Occ=Dead] { I# y2 ->
                        case y1 of _ [Occ=Dead] { I# y3 ->
                        case tagToEnum# @ Bool (==# (+# w y2) y3) of _ [Occ=Dead] {
                          False ->
                            letrec {
                              $wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
                              [LclId, Arity=1, Str=DmdType <S,1*U>]
                              $wgo2 =
                                \ (w2 :: [Int]) ->
                                  case w2 of _ [Occ=Dead] {
                                    [] -> ww1;
                                    : y4 ys2 ->
                                      case y4 of _ [Occ=Dead] { I# y5 ->
                                      case tagToEnum# @ Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
                                        False -> $wgo2 ys2;
                                        True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
                                      }
                                      }
                                  }; } in
                            $wgo2 ys1;
                          True ->
                            letrec {
                              $wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
                              [LclId, Arity=1, Str=DmdType <S,1*U>]
                              $wgo2 =
                                \ (w2 :: [Int]) ->
                                  case w2 of _ [Occ=Dead] {
                                    [] -> ww1;
                                    : y4 ys2 ->
                                      case y4 of _ [Occ=Dead] { I# y5 ->
                                      case tagToEnum# @ Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
                                        False -> $wgo2 ys2;
                                        True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
                                      }
                                      }
                                  }; } in
                            case $wgo2 ys1 of ww2 { __DEFAULT -> +# 1 ww2 }
                        }
                        }
                        }
                    }
                    }
                }; } in
          $wgo1 lvl } in
    case w of wild {
      __DEFAULT -> case $wgo (+# wild 1) of ww { __DEFAULT -> $j ww };
      1000 -> $j 0
    }
end Rec }

In fact, looking at the core for print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]] 实际上,查看内核的print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]] print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]] it seems as though only the first layer of the list comprehension is fused. print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]]似乎只有列表print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]]的第一层被融合了。 Is that a bug? 那是一个错误吗?

This code gets the job done in 1 second and no extra allocation in GHC 7.10 with -O2 (see the bottom for profiling output): 此代码在1秒内完成工作,GHC 7.10中没有额外的分配-O2 (请参见底部的分析输出):

cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k = fromEnum (i+j==k)

main = print $ sum [cmpsum i j k | i <- [1..1000],
                                   j <- [1..const 1000 i],
                                   k <- [1..const 1000 j]]

In GHC 7.8, you can get almost the same results in this case (1.4 seconds) if you add the following at the beginning: 在GHC 7.8中,如果在开头添加以下内容,则在这种情况下(1.4秒)可以获得几乎相同的结果:

import Prelude hiding (sum)

sum xs = foldr (\x r a -> a `seq` r (a+x)) id xs 0

There are three issues here: 这里有三个问题:

  1. Specializing the code to Int instead of letting it default to Integer is crucial. 将代码专门Int而不是将其默认为Integer是至关重要的。

  2. GHC 7.10 offers list fusion for sum that GHC 7.8 does not. GHC 7.10为GHC 7.8不提供的sum提供了列表融合。 This is because the new definition of sum , based on a new definition of foldl , can be very bad in some cases without the "call arity" analysis Joachim Breitner created for GHC 7.10. 这是因为基于foldl的新定义, sum的新定义在某些情况下可能非常糟糕,而没有Joachim Breitner为GHC 7.10创建的“call arity”分析。

  3. GHC performs a limited "full laziness" pass very early in compilation, before any inlining occurs. 在任何内联发生之前,GHC会在编译初期执行有限的“完全懒惰”传递。 As a result, the constant [1..1000] terms for j and k , which are used multiple times in the loop, get hoisted out of the loop. 结果,在循环中多次使用的jk的常数[1..1000]项被提升出循环。 This would be good if these were actually expensive to calculate, but in this context it's much cheaper to do the additions over and over and over instead of saving the results. 如果这些计算实际上很昂贵,那将是很好的,但在这种情况下,一遍又一遍地进行添加要便宜得多,而不是保存结果。 What the code above does is trick GHC. 以上代码的作用是欺骗GHC。 Since const isn't inlined until a little bit later, this first full laziness pass doesn't see that the lists are constant, so it doesn't hoist them out. 由于const直到稍后才进行内联,因此第一次完全懒惰传递并未看到列表是常量,因此它不会将它们提升。 I wrote it this way because it's nice and short, but it is, admittedly, a little on the fragile side. 我用这种方式写它是因为它很好而且很短,但不可否认的是,它在脆弱的一面。 To make it more robust, use phased inlining: 为了使其更加健壮,请使用分阶段内联:

     main = print $ sum [cmpsum ijk | i <- [1..1000], j <- [1..konst 1000 i], k <- [1..konst 1000 j]] {-# INLINE [1] konst #-} konst = const 

    This guarantees that konst will be inlined in simplifier phase 1, but no earlier. 这保证了konst将在简化阶段1中内联,但不会更早。 Phase 1 occurs after list fusion is complete, so it's perfectly safe to let GHC see everything then. 第1阶段发生列表融合完成之后,因此让GHC看到所有内容是完全安全的。

          51,472 bytes allocated in the heap
           3,408 bytes copied during GC
          44,312 bytes maximum residency (1 sample(s))
          17,128 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.071s  (  1.076s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.073s  (  1.077s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    48,059 bytes per MUT second

  Productivity  99.9% of total user, 99.6% of total elapsed

You are comparing looping over a single statement to counting by generating an intermediate structure (a list) and folding over it. 您正在通过生成中间结构(列表)并折叠它来将单个语句上的循环与计数进行比较。 I don't know how great the performance in Java would be if you created a linked list with a billion elements iterated over it. 我不知道如果你创建一个迭代了十亿个元素的链表,Java的性能会有多大。

Here is Haskell code which is (approximately) equivalent to your Java code. 这是Haskell代码(大约)等同于您的Java代码。

{-# LANGUAGE BangPatterns #-}

main = print (loop3 1 1 1 0) 

loop1 :: Int -> Int -> Int -> Int -> Int
loop1 !i !j !k !cc | k <= 1000 = loop1 i j (k+1) (cc + fromEnum (i + j == k))
                   | otherwise = cc 

loop2 :: Int -> Int -> Int -> Int -> Int
loop2 !i !j !k !cc | j <= 1000 = loop2 i (j+1) k (loop1 i j k cc)
                   | otherwise = cc 

loop3 :: Int -> Int -> Int -> Int -> Int
loop3 !i !j !k !cc | i <= 1000 = loop3 (i+1) j k (loop2 i j k cc)
                   | otherwise = cc 

And the execution on my machine (test2 is your Haskell code): 在我的机器上执行(test2是你的Haskell代码):

$ ghc --make -O2 test1.hs && ghc --make -O2 test2.hs && javac test3.java
$ time ./test1.exe && time ./test2.exe && time java test3
499500

real    0m1.614s
user    0m0.000s
sys     0m0.000s
499500

real    0m35.922s
user    0m0.000s
sys     0m0.000s
499500

real    0m1.589s
user    0m0.000s
sys     0m0.015s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM