简体   繁体   中英

What are possible Haskell optimizations keys?

I found benchmark that solves really simple task in different languages https://github.com/starius/lang-bench . Here 's the code for Haskell :

cmpsum i j k =
    if i + j == k then 1 else 0

main = print (sum([cmpsum i j k |
    i <- [1..1000], j <- [1..1000], k <- [1..1000]]))

This code runs very slow as you can see in benchmark and I found this very strange. I tried to inline the function cmpsum and compile with the next flags:

ghc -c -O2 main.hs

but it really didn't help. I am not asking about optimizing the algorithm cause it's the same for all languages, but about possible compiler or code optimizations that can make this code run faster.

Not a complete answer, sorry. Compiling with GHC 7.10 on my machine I get ~12s for your version.

I'd suggest always compiling with -Wall which shows us that our numbers are being defaulted to the infinite precision Integer type. Fixing that:

module Main where

cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k =
    if i + j == k then 1 else 0

main :: IO ()
main = print (sum([cmpsum i j k |
    i <- [1..1000], j <- [1..1000], k <- [1..1000]]))

This runs in ~5s for me. Running with +RTS -s seems to show we have a loop in constant memory:

          87,180 bytes allocated in the heap
           1,704 bytes copied during GC
          42,580 bytes maximum residency (1 sample(s))
          18,860 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    4.920s  (  4.919s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    4.920s  (  4.921s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    17,719 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

-fllvm shaves off another second or so. Maybe someone else can look into it further.

Edit : Just digging into this a little further. It doesn't look like fusion is happening. Even if I change sum to a foldr (+) 0 which is an explicit "good producer/good consumer" pair.

Rec {
$wgo [InlPrag=[0], Occ=LoopBreaker] :: Int# -> Int#
[GblId, Arity=1, Str=DmdType <S,U>]
$wgo =
  \ (w :: Int#) ->
    let {
      $j :: Int# -> Int#
      [LclId, Arity=1, Str=DmdType]
      $j =
        \ (ww [OS=OneShot] :: Int#) ->
          letrec {
            $wgo1 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
            [LclId, Arity=1, Str=DmdType <S,1*U>]
            $wgo1 =
              \ (w1 :: [Int]) ->
                case w1 of _ [Occ=Dead] {
                  [] -> ww;
                  : y ys ->
                    case $wgo1 ys of ww1 { __DEFAULT ->
                    case lvl of _ [Occ=Dead] {
                      [] -> ww1;
                      : y1 ys1 ->
                        case y of _ [Occ=Dead] { I# y2 ->
                        case y1 of _ [Occ=Dead] { I# y3 ->
                        case tagToEnum# @ Bool (==# (+# w y2) y3) of _ [Occ=Dead] {
                          False ->
                            letrec {
                              $wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
                              [LclId, Arity=1, Str=DmdType <S,1*U>]
                              $wgo2 =
                                \ (w2 :: [Int]) ->
                                  case w2 of _ [Occ=Dead] {
                                    [] -> ww1;
                                    : y4 ys2 ->
                                      case y4 of _ [Occ=Dead] { I# y5 ->
                                      case tagToEnum# @ Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
                                        False -> $wgo2 ys2;
                                        True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
                                      }
                                      }
                                  }; } in
                            $wgo2 ys1;
                          True ->
                            letrec {
                              $wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
                              [LclId, Arity=1, Str=DmdType <S,1*U>]
                              $wgo2 =
                                \ (w2 :: [Int]) ->
                                  case w2 of _ [Occ=Dead] {
                                    [] -> ww1;
                                    : y4 ys2 ->
                                      case y4 of _ [Occ=Dead] { I# y5 ->
                                      case tagToEnum# @ Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
                                        False -> $wgo2 ys2;
                                        True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
                                      }
                                      }
                                  }; } in
                            case $wgo2 ys1 of ww2 { __DEFAULT -> +# 1 ww2 }
                        }
                        }
                        }
                    }
                    }
                }; } in
          $wgo1 lvl } in
    case w of wild {
      __DEFAULT -> case $wgo (+# wild 1) of ww { __DEFAULT -> $j ww };
      1000 -> $j 0
    }
end Rec }

In fact, looking at the core for print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]] print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]] it seems as though only the first layer of the list comprehension is fused. Is that a bug?

This code gets the job done in 1 second and no extra allocation in GHC 7.10 with -O2 (see the bottom for profiling output):

cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k = fromEnum (i+j==k)

main = print $ sum [cmpsum i j k | i <- [1..1000],
                                   j <- [1..const 1000 i],
                                   k <- [1..const 1000 j]]

In GHC 7.8, you can get almost the same results in this case (1.4 seconds) if you add the following at the beginning:

import Prelude hiding (sum)

sum xs = foldr (\x r a -> a `seq` r (a+x)) id xs 0

There are three issues here:

  1. Specializing the code to Int instead of letting it default to Integer is crucial.

  2. GHC 7.10 offers list fusion for sum that GHC 7.8 does not. This is because the new definition of sum , based on a new definition of foldl , can be very bad in some cases without the "call arity" analysis Joachim Breitner created for GHC 7.10.

  3. GHC performs a limited "full laziness" pass very early in compilation, before any inlining occurs. As a result, the constant [1..1000] terms for j and k , which are used multiple times in the loop, get hoisted out of the loop. This would be good if these were actually expensive to calculate, but in this context it's much cheaper to do the additions over and over and over instead of saving the results. What the code above does is trick GHC. Since const isn't inlined until a little bit later, this first full laziness pass doesn't see that the lists are constant, so it doesn't hoist them out. I wrote it this way because it's nice and short, but it is, admittedly, a little on the fragile side. To make it more robust, use phased inlining:

     main = print $ sum [cmpsum ijk | i <- [1..1000], j <- [1..konst 1000 i], k <- [1..konst 1000 j]] {-# INLINE [1] konst #-} konst = const 

    This guarantees that konst will be inlined in simplifier phase 1, but no earlier. Phase 1 occurs after list fusion is complete, so it's perfectly safe to let GHC see everything then.

          51,472 bytes allocated in the heap
           3,408 bytes copied during GC
          44,312 bytes maximum residency (1 sample(s))
          17,128 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.071s  (  1.076s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.073s  (  1.077s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    48,059 bytes per MUT second

  Productivity  99.9% of total user, 99.6% of total elapsed

You are comparing looping over a single statement to counting by generating an intermediate structure (a list) and folding over it. I don't know how great the performance in Java would be if you created a linked list with a billion elements iterated over it.

Here is Haskell code which is (approximately) equivalent to your Java code.

{-# LANGUAGE BangPatterns #-}

main = print (loop3 1 1 1 0) 

loop1 :: Int -> Int -> Int -> Int -> Int
loop1 !i !j !k !cc | k <= 1000 = loop1 i j (k+1) (cc + fromEnum (i + j == k))
                   | otherwise = cc 

loop2 :: Int -> Int -> Int -> Int -> Int
loop2 !i !j !k !cc | j <= 1000 = loop2 i (j+1) k (loop1 i j k cc)
                   | otherwise = cc 

loop3 :: Int -> Int -> Int -> Int -> Int
loop3 !i !j !k !cc | i <= 1000 = loop3 (i+1) j k (loop2 i j k cc)
                   | otherwise = cc 

And the execution on my machine (test2 is your Haskell code):

$ ghc --make -O2 test1.hs && ghc --make -O2 test2.hs && javac test3.java
$ time ./test1.exe && time ./test2.exe && time java test3
499500

real    0m1.614s
user    0m0.000s
sys     0m0.000s
499500

real    0m35.922s
user    0m0.000s
sys     0m0.000s
499500

real    0m1.589s
user    0m0.000s
sys     0m0.015s

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM