使用GHC最大化Haskell循环性能

Question

In order to compare performance with lists being slow in this GHC bug I'm trying to get the following loop as fast as possible: 为了比较性能与GHC bug中缓慢的列表，我试图尽可能快地得到以下循环：

{-# LANGUAGE BangPatterns #-}

module Main (main) where

import Control.Monad
import Data.Word


main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      print (fromIntegral i / fromIntegral (maxBound :: Word32))


loop :: Word32 -> (Word32 -> IO ()) -> IO ()
loop n f = go 0
  where
    go !i | i == n = return ()
    go !i          = f i >> go (i + 1)

compiled with ghc -O loop.hs . 用ghc -O loop.hs编译。

However, running this takes 50 seconds on my computer - 10 times slower than the equivalent C program: 但是，在我的计算机上运行它需要50秒 - 比同等的C程序慢10倍 ：

#include "limits.h"
#include "stdint.h"
#include "stdio.h"

int main(int argc, char const *argv[])
{
  for (uint32_t i = 0; i < UINT_MAX; ++i)
  {
    if (i % 100000000 == 0) printf("%f\n", (float) i / (float) UINT_MAX );
  }
  return 0;
}

compiled with gcc -O2 -std=c99 -o testc test.c . 用gcc -O2 -std=c99 -o testc test.c编译。

Using the freshly released GHC 7.8 or using -O2 did not improve the performance. 使用新发布的GHC 7.8或使用-O2不会改善性能。

However, compiling with the -fllvm flag (on either ghc version) brought a 10x speed improvement, bringing the performance on par with C. 但是，使用-fllvm标志进行编译（在任何一个ghc版本上）都提高了10倍的速度，使性能与C相提并论。

Questions: 问题：

Why is GHC's native codegen so much slower for my loop ? 为什么GHC的本机代码对我的loop要慢得多？
Is there a way to improve my loop so that it is fast also without -fllvm , or is this already the fastest IO loop over Word32 one can achive? 有没有办法改善我的循环，以便它也很快没有-fllvm ，或者这已经是Word32最快的IO循环了？

Answer 1

Let's inspect the assembly. 我们来检查一下装配。 I modified the main function a bit so that the output becomes a bit clearer (but the performance remains identical). 我稍微修改了主函数，使输出变得更清晰（但性能保持不变）。 I used GHC 7.8.2 with -O2. 我使用GHC 7.8.2和-O2。

main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      putStrLn "foo"

There is a lot of clutter, so I try to only include the interesting parts: 有很多混乱，所以我尝试只包括有趣的部分：

Native Codegen Native Codegen

Main_zdwa_info:
.Lc3JD: /* check if there's enough space for stack growth */
    leaq -16(%rbp),%rax
    cmpq %r15,%rax
    jb .Lc3JO /* this jumps to some GC code that grows the stack, then
                 reenters the main closure */
.Lc3JP:
    movl $4294967295,%eax /* issue: loading the bound on every iteration */
    cmpq %rax,%r14
    jne .Lc3JB
.Lc3JC:
   /* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
    movl $100000000,%eax /* issue: unnecessary moves */
    movq %rax,%rbx      
    movq %r14,%rax
    xorq %rdx,%rdx
    divq %rbx /* issue: doing the division (llvm and gcc avoid this) */
    testq %rdx,%rdx
    jne .Lc3JU
.Lc3JV: 
   /* do the printing. Code omitted. */
.Lc3JN:
   /* increment index and (I guess) restore registers messed up by the printing */
    movq 8(%rbp),%rax
    incq %rax  
    movl %eax,%r14d
    addq $16,%rbp
    jmp Main_zdwa_info
.Lc3JU:
    leaq 1(%r14),%rax   /*issue: why not just increment r14? */
    movl %eax,%r14d     
    jmp Main_zdwa_info

LLVM LLVM

 Main_zdwa_info:
/* code omitted: the same stack-checking stuff as in native */
.LBB1_1:
    movl    $4294967295, %esi /* load the bound */
    movabsq $-6067343680855748867, %rdi /*load a magic number for the modulus */
    jmp .LBB1_2
.LBB1_4:              
    incl    %ecx
.LBB1_2:  
    cmpq    %rsi, %rcx
    je  .LBB1_6 /* check bound */

    /* do the modulus with two multiplications, a shift and a magic number */
    /* note : gcc does the same reduction */ 
    movq    %rcx, %rax
    mulq    %rdi
    shrq    $26, %rdx
    imulq   $100000000, %rdx, %rax  
    cmpq    %rax, %rcx
    jne .LBB1_4 
    /* Code omitted: print, then return to loop beginning */
.LBB1_6:                       
    /* Code omitted: return from main */

Observations 意见

IO overhead is nonexistent in both assemblies. 两个程序集中都不存在IO开销。 The zero-byte RealWorld state token is conspicuously absent. 零字节的RealWorld状态令牌显然不存在。
Native codegen doesn't do much strength reduction, in contrast to LLVM, which readily converts the modulus into multiplication, shift and magic numbers. 与LLVM相比，本地codegen没有做太多的强度降低，LLVM很容易将模数转换为乘法，移位和幻数。
Native codegen redoes the stack space checking on each iteration, while LLVM doesn't. Native codegen在每次迭代时重做堆栈空间检查，而LLVM则不会。 It doesn't seem to be a significant overhead, however. 然而，它似乎并不是一个重要的开销。
Native codegen is just plain bad here at looping and register allocation. 本地codegen在循环和寄存器分配方面非常糟糕。 It shuffles around registers and loads the bound on each iteration. 它会在寄存器周围进行混洗，并在每次迭代时加载绑定。 LLVM emits code comparable to hand-written code in tidiness. LLVM在整洁中发出与手写代码相当的代码。

As to your question: 至于你的问题：

Is there a way to improve my loop so that it is fast also without -fllvm, or is this >already the fastest IO loop over Word32 one can achieve? 有没有办法改善我的循环，以便它也很快没有-fllvm，或者这>已经是Word32上可以实现的最快的IO循环？

The best you can do here is manual strength reduction, I think, though I personally find that option unacceptable. 我认为，你可以做的最好的是减少手动强度，尽管我个人认为这个选项是不可接受的。 However, after doing that your code will be still significantly slower. 但是，在执行此操作后，您的代码仍然会显着变慢。 I also ran the following trivial loop, and it's twice as fast with LLVM than with native: 我还运行了以下简单的循环，它使用LLVM的速度是本机的两倍：

import Data.Word
main = go 0 where
    go :: Word32 -> IO ()
    go i | i == maxBound = return ()
    go i = go (i + 1)

The culprit is again unnecessary register-shuffling and bound-loading. 罪魁祸首再次是不必要的寄存器重排和绑定加载。 There isn't really any way to remedy these kind of low level issues, aside from switching to LLVM. 除了切换到LLVM之外，没有任何方法可以解决这些低级问题。

Answer 2

An easy optimization would be to use Float division instead of the default Double . 一个简单的优化是使用Float division而不是默认的Double 。 Just write a convenience function to replace fromIntegral 只需编写一个便利函数来替换fromIntegral

w2f :: Word32 -> Float
w2f = fromIntegral

However, it is much faster to compute the loop like this: 但是，像这样计算循环要快得多：

main :: IO () 
main = forM_ [0, 100000000 .. mb] $ \i ->
    print (fromIntegral i / fromIntegral mb :: Float))
    where mb = maxBound :: Word32

使用GHC最大化Haskell循环性能

问题描述

2 个解决方案

解决方案1
12 已采纳 2014-04-27 10:32:19

Native Codegen Native Codegen

LLVM LLVM

Observations 意见

解决方案2
0 2014-04-27 04:30:31

使用GHC最大化Haskell循环性能

问题描述

2 个解决方案

解决方案1 12 已采纳 2014-04-27 10:32:19

Native Codegen Native Codegen

LLVM LLVM

Observations 意见

解决方案2 0 2014-04-27 04:30:31

解决方案1
12 已采纳 2014-04-27 10:32:19

解决方案2
0 2014-04-27 04:30:31