GHC 7.10生成的代码比旧版本慢

Question

我意识到最新版本的GHC（7.10.3）产生的代码比旧版本慢得多。 我目前的版本：

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3

我还在本地计算机上安装了另外两个旧版本。

我的测试代码来自这里（ collatz1.hs代码）：

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if even a then a else 3*a+1) `div` 2

-- new code
collatzLen :: Word32 -> Int
collatzLen a0 = lenIterWhile collatzNext (/= 1) a0

lenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int
lenIterWhile next notDone start = len start 0 where
    len n m = if notDone n
                then len (next n) (m+1)
                else m
-- End of new code

main = do
    [a0] <- getArgs
    let max_a0 = (read a0)::Word32
    print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]

使用GHC 7.4,7.6和7.10进行编译会产生以下时间：

$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.879s
user    0m1.876s
sys     0m0.000s

$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.901s
user    0m1.896s
sys     0m0.000s

$ ~/Tools/ghc/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m10.562s
user    0m10.528s
sys     0m0.036s

我们可以告诉毫无疑问，最新版本的GHC产生的代码比旧的两个版本更差。 我不能重现与博客相同的效率，虽然可能是因为我没有LLVM而且我没有作者使用的确切版本。 但是，我相信结论是显而易见的。

总的来说，我的问题是为什么会发生这种情况？ 不知何故GHC变得比以前更糟糕了。 具体来说，如果我想调查一下，我应该如何开始自己？

Answer 1

这是两个配置文件的比较（ diff Test-GHC-7-8-4.prof Test-GHC-7-10-3.prof ）

1c1                               
<       Fri Mar 11 19:58 2016 Time and Allocation Profiling Report  (Final)
---                               
>       Fri Mar 11 19:59 2016 Time and Allocation Profiling Report  (Final)
5,6c5,6                               
<       total time  =        2.40 secs   (2400 ticks @ 1000 us, 1 processor)
<       total alloc = 256,066,744 bytes  (excludes profiling overheads)
---                               
>       total time  =       10.89 secs   (10895 ticks @ 1000 us, 1 processor)
>       total alloc = 15,713,590,808 bytes  (excludes profiling overheads)
10,13c10,12                               
< lenIterWhile.len Main     93.8   0.0                    
< collatzMax       Main      2.2   93.7
< collatzNext      Main      2.0    0.0
< lenIterWhile     Main      1.5    6.2
---                                
> collatzNext      Main     79.6   89.4
> lenIterWhile.len Main     18.9    8.8
> collatzMax       Main      0.8    1.5

发生了一些非常奇怪的事情。 虽然在GHC lenIterWhile.len大部分时间lenIterWhile.len在服用，但collatzNext现在是罪魁祸首。 我们来看看转储核心：

-- GHC 7.8.4
Rec {
Main.$wlen [Occ=LoopBreaker]
  :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wlen =
  \ (ww_s4Mn :: GHC.Prim.Word#) (ww1_s4Mr :: GHC.Prim.Int#) ->
    case ww_s4Mn of wild_XQ {
      __DEFAULT ->
        case GHC.Prim.remWord# wild_XQ (__word 2) of _ [Occ=Dead] {
          __DEFAULT ->
            Main.$wlen
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_XQ))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s4Mr 1);
          __word 0 ->
            Main.$wlen
              (GHC.Prim.quotWord# wild_XQ (__word 2)) (GHC.Prim.+# ww1_s4Mr 1)
        };
      __word 1 -> ww1_s4Mr
    }
end Rec }

似乎或多或少合理。 现在关于GHC 7.10.3：

Rec {
$wlen_r6Sy :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Str=DmdType <S,U><L,U>]
$wlen_r6Sy =
  \ (ww_s60s :: GHC.Prim.Word#) (ww1_s60w :: GHC.Prim.Int#) ->
    case ww_s60s of wild_X1Z {
      __DEFAULT ->
        
          False ->
            $wlen_r6Sy
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1Z))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s60w 1);
          True ->
            $wlen_r6Sy
              (GHC.Prim.quotWord# wild_X1Z (__word 2)) (GHC.Prim.+# ww1_s60w 1)
        };
      __word 1 -> ww1_s60w
    }
end Rec }

好吧，好像是一样的。 除了even的召唤。 让我们来代替even用的直列变种一个Integral ，比如x rem 2 == 0 ：

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if a `rem` 2 == 0 then a else 3*a+1) `div` 2

-- rest of code the same

让我们通过分析和检查再次编译它：

$ stack --resolver=ghc-7.10 ghc -- Test.hs -O2 -fforce-recomp -prof -fprof-auto -auto-all
$ ./Test +RTS -s -p -RTS 
(329,837799)
     416,119,240 bytes allocated in the heap
          69,760 bytes copied during GC
          59,368 bytes maximum residency (2 sample(s))
          21,912 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       800 colls,     0 par    0.000s   0.002s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0002s    0.0003s

  INIT    time    0.000s  (  0.019s elapsed)
  MUT     time    2.500s  (  2.546s elapsed)
  GC      time    0.000s  (  0.003s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    2.500s  (  2.567s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    166,447,696 bytes per MUT second

  Productivity 100.0% of total user, 97.4% of total elapsed

$ cat Test.prof
        Fri Mar 11 20:22 2016 Time and Allocation Profiling Report  (Final)

           Test.exe +RTS -s -p -RTS 1000000

        total time  =        2.54 secs   (2535 ticks @ 1000 us, 1 processor)
        total alloc = 256,066,984 bytes  (excludes profiling overheads)

COST CENTRE      MODULE  %time %alloc

lenIterWhile.len Main     94.4    0.0
main             Main      1.9   93.7
collatzNext      Main      1.8    0.0
lenIterWhile     Main      1.3    6.2

                                                                   individual     inherited
COST CENTRE           MODULE                     no.     entries  %time %alloc   %time %alloc

MAIN                  MAIN                        44           0    0.0    0.0   100.0  100.0
 main                 Main                        89           0    1.9   93.7   100.0  100.0
  main.\              Main                        92     1000000    0.4    0.0    98.1    6.2
   collatzLen         Main                        93     1000000    0.2    0.0    97.8    6.2
    lenIterWhile      Main                        94     1000000    1.3    6.2    97.5    6.2
     lenIterWhile.len Main                        95    88826840   94.4    0.0    96.2    0.0
      collatzNext     Main                        96    87826840    1.8    0.0     1.8    0.0
  main.max_a0         Main                        90           1    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding.CodePage    73           0    0.0    0.0     0.0    0.0
 CAF                  System.Environment          64           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Handle.Text          62           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding             61           0    0.0    0.0     0.0    0.0

似乎修复它。 所以问题是GHC-7.8内联even ，而GHC-7.10则没有。 这是因为为Int和Integer添加了{-# SPECIALISE even :: x -> x -> Bool #-}规则，这些规则不允许内联。

正如问题的讨论文件制作even和odd {-# INLINEABLE ... #-}将解决此问题。 请注意，出于性能原因，添加了专门化本身。

GHC 7.10生成的代码比旧版本慢

问题描述

1 个解决方案

解决方案1
18 已采纳 2016-03-11 19:28:23

GHC 7.10生成的代码比旧版本慢

问题描述

1 个解决方案

解决方案1 18 已采纳 2016-03-11 19:28:23

解决方案1
18 已采纳 2016-03-11 19:28:23