使用GHC最大化Haskell循環性能

Question

為了比較性能與GHC bug中緩慢的列表，我試圖盡可能快地得到以下循環：

{-# LANGUAGE BangPatterns #-}

module Main (main) where

import Control.Monad
import Data.Word


main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      print (fromIntegral i / fromIntegral (maxBound :: Word32))


loop :: Word32 -> (Word32 -> IO ()) -> IO ()
loop n f = go 0
  where
    go !i | i == n = return ()
    go !i          = f i >> go (i + 1)

用ghc -O loop.hs編譯。

但是，在我的計算機上運行它需要50秒 - 比同等的C程序慢10倍 ：

#include "limits.h"
#include "stdint.h"
#include "stdio.h"

int main(int argc, char const *argv[])
{
  for (uint32_t i = 0; i < UINT_MAX; ++i)
  {
    if (i % 100000000 == 0) printf("%f\n", (float) i / (float) UINT_MAX );
  }
  return 0;
}

用gcc -O2 -std=c99 -o testc test.c編譯。

使用新發布的GHC 7.8或使用-O2不會改善性能。

但是，使用-fllvm標志進行編譯（在任何一個ghc版本上）都提高了10倍的速度，使性能與C相提並論。

問題：

為什么GHC的本機代碼對我的loop要慢得多？
有沒有辦法改善我的循環，以便它也很快沒有-fllvm ，或者這已經是Word32最快的IO循環了？

Answer 1

我們來檢查一下裝配。 我稍微修改了主函數，使輸出變得更清晰（但性能保持不變）。 我使用GHC 7.8.2和-O2。

main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      putStrLn "foo"

有很多混亂，所以我嘗試只包括有趣的部分：

Native Codegen

Main_zdwa_info:
.Lc3JD: /* check if there's enough space for stack growth */
    leaq -16(%rbp),%rax
    cmpq %r15,%rax
    jb .Lc3JO /* this jumps to some GC code that grows the stack, then
                 reenters the main closure */
.Lc3JP:
    movl $4294967295,%eax /* issue: loading the bound on every iteration */
    cmpq %rax,%r14
    jne .Lc3JB
.Lc3JC:
   /* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
    movl $100000000,%eax /* issue: unnecessary moves */
    movq %rax,%rbx      
    movq %r14,%rax
    xorq %rdx,%rdx
    divq %rbx /* issue: doing the division (llvm and gcc avoid this) */
    testq %rdx,%rdx
    jne .Lc3JU
.Lc3JV: 
   /* do the printing. Code omitted. */
.Lc3JN:
   /* increment index and (I guess) restore registers messed up by the printing */
    movq 8(%rbp),%rax
    incq %rax  
    movl %eax,%r14d
    addq $16,%rbp
    jmp Main_zdwa_info
.Lc3JU:
    leaq 1(%r14),%rax   /*issue: why not just increment r14? */
    movl %eax,%r14d     
    jmp Main_zdwa_info

LLVM

 Main_zdwa_info:
/* code omitted: the same stack-checking stuff as in native */
.LBB1_1:
    movl    $4294967295, %esi /* load the bound */
    movabsq $-6067343680855748867, %rdi /*load a magic number for the modulus */
    jmp .LBB1_2
.LBB1_4:              
    incl    %ecx
.LBB1_2:  
    cmpq    %rsi, %rcx
    je  .LBB1_6 /* check bound */

    /* do the modulus with two multiplications, a shift and a magic number */
    /* note : gcc does the same reduction */ 
    movq    %rcx, %rax
    mulq    %rdi
    shrq    $26, %rdx
    imulq   $100000000, %rdx, %rax  
    cmpq    %rax, %rcx
    jne .LBB1_4 
    /* Code omitted: print, then return to loop beginning */
.LBB1_6:                       
    /* Code omitted: return from main */

意見

兩個程序集中都不存在IO開銷。 零字節的RealWorld狀態令牌顯然不存在。
與LLVM相比，本地codegen沒有做太多的強度降低，LLVM很容易將模數轉換為乘法，移位和幻數。
Native codegen在每次迭代時重做堆棧空間檢查，而LLVM則不會。 然而，它似乎並不是一個重要的開銷。
本地codegen在循環和寄存器分配方面非常糟糕。 它會在寄存器周圍進行混洗，並在每次迭代時加載綁定。 LLVM在整潔中發出與手寫代碼相當的代碼。

至於你的問題：

有沒有辦法改善我的循環，以便它也很快沒有-fllvm，或者這>已經是Word32上可以實現的最快的IO循環？

我認為，你可以做的最好的是減少手動強度，盡管我個人認為這個選項是不可接受的。 但是，在執行此操作后，您的代碼仍然會顯着變慢。 我還運行了以下簡單的循環，它使用LLVM的速度是本機的兩倍：

import Data.Word
main = go 0 where
    go :: Word32 -> IO ()
    go i | i == maxBound = return ()
    go i = go (i + 1)

罪魁禍首再次是不必要的寄存器重排和綁定加載。 除了切換到LLVM之外，沒有任何方法可以解決這些低級問題。

Answer 2

一個簡單的優化是使用Float division而不是默認的Double 。 只需編寫一個便利函數來替換fromIntegral

w2f :: Word32 -> Float
w2f = fromIntegral

但是，像這樣計算循環要快得多：

main :: IO () 
main = forM_ [0, 100000000 .. mb] $ \i ->
    print (fromIntegral i / fromIntegral mb :: Float))
    where mb = maxBound :: Word32

使用GHC最大化Haskell循環性能

問題描述

2 個解決方案

解決方案1
12 已采納 2014-04-27 10:32:19

Native Codegen

LLVM

意見

解決方案2
0 2014-04-27 04:30:31

使用GHC最大化Haskell循環性能

問題描述

2 個解決方案

解決方案1 12 已采納 2014-04-27 10:32:19

Native Codegen

LLVM

意見

解決方案2 0 2014-04-27 04:30:31

解決方案1
12 已采納 2014-04-27 10:32:19

解決方案2
0 2014-04-27 04:30:31