为什么这种斐波那契的尾调用比 Haskell 中的纯树递归运行得更快？

Question

I'm trying to undestand tail call recursions.我正在尝试理解尾调用递归。 I convert pure tree-recursion fibonacci function:我转换纯树递归斐波那契函数：

fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

to a tail call version:到尾调用版本：

fib' 0 a = a
fib' 1 a = 1 + a
fib' n a = fib' (n-1) (fib' (n-2) a)

When I try these two versions, it seems that the second one is faster than the first tree-recusion even though I tried to use seq to force strict evaluation in the second one!当我尝试这两个版本时，尽管我尝试在第二个版本中使用seq强制严格评估，但似乎第二个版本比第一个 tree-recusion 更快！

How does Haskell treat such tail calls inside GHC? Haskell 如何处理 GHC 中的此类尾调用？ Thanks!谢谢！

Answer 1

Performance of code tested at the GHCi interactive prompt can be quite misleading, so when benchmarking GHC code, it's a good idea to test it in a standalone executable compiled with ghc -O2 .在 GHCi 交互式提示符下测试的代码的性能可能会产生误导，因此在对 GHC 代码进行基准测试时，最好在使用ghc -O2编译的独立可执行文件中对其进行测试。 Adding explicit type signatures and making sure -Wall doesn't report any warnings about "defaulting" types is helpful, too.添加显式类型签名并确保-Wall不报告有关“默认”类型的任何警告也很有帮助。 Otherwise, GHC may choose default numeric types that you didn't intend.否则，GHC 可能会选择您不想要的默认数字类型。 Finally, it's also a good idea to use the criterion benchmarking library, since it does a good job generating reliable and reproducible timing results.最后，使用criterion基准测试库也是一个好主意，因为它可以很好地生成可靠且可重复的计时结果。

Benchmarking your two fib versions this way with the program:使用程序以这种方式对您的两个fib版本进行基准测试：

import Criterion.Main

fib :: Integer -> Integer
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

fib' :: Integer -> Integer -> Integer
fib' 0 a = a
fib' 1 a = 1 + a
fib' n a = fib' (n-1) (fib' (n-2) a)

main :: IO ()
main = defaultMain
  [ bench "fib" $ whnf fib 30
  , bench "fib'" $ whnf (fib' 30) 0
  ]

compiled with GHC 8.6.5 using ghc -O2 -Wall Fib2.hs , I get:使用ghc -O2 -Wall Fib2.hs用 GHC 8.6.5 编译，我得到：

$ ./Fib2
benchmarking fib
time                 40.22 ms   (39.91 ms .. 40.45 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 39.91 ms   (39.51 ms .. 40.11 ms)
std dev              581.2 μs   (319.5 μs .. 906.9 μs)

benchmarking fib'
time                 38.88 ms   (38.69 ms .. 39.06 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 38.57 ms   (38.49 ms .. 38.67 ms)
std dev              188.7 μs   (139.6 μs .. 268.3 μs)

The difference here is quite small, but can be consistently reproduced.这里的差异很小，但可以一致地重现。 The fib' version is about 3-5% faster than the fib version. fib'版本比fib版本快大约 3-5%。

At this point, it's maybe worth pointing out that my type signatures used Integer .在这一点上，也许值得指出的是，我的类型签名使用了Integer 。 This is also the default that GHC would have selected without explicit type signatures.这也是 GHC 在没有显式类型签名的情况下选择的默认值。 Replacing these with Int results in a massive performance improvement:用Int替换这些会带来巨大的性能提升：

benchmarking fib
time                 4.877 ms   (4.850 ms .. 4.908 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 4.766 ms   (4.730 ms .. 4.808 ms)
std dev              122.2 μs   (98.16 μs .. 162.4 μs)

benchmarking fib'
time                 3.295 ms   (3.260 ms .. 3.332 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 3.218 ms   (3.202 ms .. 3.240 ms)
std dev              62.51 μs   (44.57 μs .. 88.39 μs)

That's why I recommend including explicit type signatures and making sure there are no warnings about default types.这就是为什么我建议包含显式类型签名并确保没有关于默认类型的警告。 Otherwise, you can spend a lot of time chasing tiny improvements when the real problem is a loop index that uses Integer when it could have used Int .否则，当真正的问题是循环索引使用Integer而本可以使用Int时，您可能会花费大量时间来追求微小的改进。 For this example, of course, there's the additional issue that the algorithm is all wrong, since the algorithm is quadratic, and a linear implementation is possible, like the usual "clever Haskell" solution:对于这个例子，当然，还有一个额外的问题是算法全都错了，因为算法是二次的，线性实现是可能的，就像通常的“聪明的 Haskell”解决方案：

-- fib'' 30 runs about 100 times faster than fib 30
fib'' :: Int -> Int
fib'' n = fibs !! n
  where fibs = scanl (+) 0 (1:fibs)

Anyway, let's switch back to fib and fib' using Integer for the rest of this answer...无论如何，让我们在这个答案的其余部分使用Integer切换回fib和fib' ...

The GHC compiler produces an intermediate form of a program called the STG (spineless, tagless, G-machine). GHC 编译器生成称为 STG（无骨干、无标签、G 机器）的程序的中间形式。 It's the highest-level representation that faithfully represents how the program will actually be run.它是忠实地表示程序实际运行方式的最高级别表示。 The best documentation of STG and how it's actually translated into heap allocations and stack frames is the paper Making a fast curry: push/enter versus eval/apply for higher-order languages .关于 STG 以及它如何实际转换为堆分配和堆栈帧的最佳文档是论文制作快速咖喱：push/enter vs eval/apply for high-order languages 。 When reading this paper, Figure 1 is the STG language (though the syntax differs from what GHC produces with -ddump-stg ) and Figure 2's first and third panels show how STG is evaluated using an eval/apply approach (which matches current GHC-generated code).阅读本文时，图 1 是 STG 语言（尽管语法与 GHC 使用-ddump-stg生成的不同），图 2 的第一和第三个面板显示了如何使用 eval/apply 方法评估 STG（与当前的 GHC-生成的代码）。 There's also an older paper Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine that provides a lot more detail (probably too much), but it's a little out-of-date.还有一篇较旧的论文在库存硬件上实现惰性功能语言：Spineless Tagless G-machine提供了更多细节（可能太多），但它有点过时了。

Anyway, to see the difference between fib and fib' , we can look at the generated STG using:无论如何，要查看fib和fib'之间的区别，我们可以使用以下方法查看生成的 STG：

ghc -O2 -ddump-stg -dsuppress-all -fforce-recomp Fib2.hs

Taking the STG output and substantially cleaning it up to look more like "regular Haskell", I get the following definitions:获取 STG 输出并对其进行大量清理以使其看起来更像“常规 Haskell”，我得到以下定义：

fib = \n ->                          fib' = \n a ->
  case (==) n 0 of                     case (==) n 0 of
    True -> 0                            True -> a;
    _ ->                                 _ ->
      case (==) n 1 of                     case (==) n 1 of
        True -> 1                            True -> (+) 1 a;                 -- (6)
        _ ->                                 _ ->
          case (-) n 2 of                      case (-) n 2 of
            n_minus_2 ->                         n_minus_2 ->
              case fib n_minus_2 of                case fib' n_minus_2 a of
                y ->                                 y ->
                  case (-) n 1 of                      case (-) n 1 of
                    n_minus_1 ->                         n_minus_1 ->
                      case fib n_minus_1 of                fib' n_minus_1 y   -- (14)
                        x -> (+) x y

Here, strictness analysis has already made the entire computation strict.在这里，严格性分析已经使整个计算变得严格。 There are no thunks created here.这里没有创建 thunk。 (In STG, only let blocks create thunks, and there are no let blocks in this STG.) So, the (minimal) performance difference between these two implementations has nothing to do with strict versus lazy. （在 STG 中，只let块创建 thunk，而在这个 STG 中没有let块。）因此，这两种实现之间的（最小）性能差异与严格与惰性无关。

Ignoring the extra argument to fib' , note that these two implementations are essentially structurally identical except for the addition operation in line (6) in fib' and the case statement with addition operation in line (14) in fib .忽略了额外的参数来fib'注意，这两个实施方式中基本上是除了在管线（6）中的加法操作结构上相同的fib'在和管线（14）的加法运算的情况下声明fib 。

To understand the difference between these two implementations, you first need to understand that a function call fab is compiled to the pseudocode:要理解这两种实现的区别，首先需要了解一个函数调用fab被编译成伪代码：

lbl_f:  load args a,b
        jump to f_entry

Note that all function calls, whether or not they are tail calls, are compiled to jumps like this.请注意，所有函数调用，无论它们是否是尾调用，都被编译为这样的跳转。 When the code in f_entry completes, it will jump to whatever continuation frame is at the top of the stack, so if the caller wants to do something with the result of a function call, it should push a continuation frame before jumping.当f_entry的代码完成时，它将跳转到堆栈顶部的任何延续帧，因此如果调用者想要对函数调用的结果做一些事情，它应该在跳转之前推送一个延续帧。

For example, the block of code:例如，代码块：

case f a b of
    True -> body1
    _    -> body2

wants to do something with the return value of fab , so it compiles to the following (unoptimized) pseudocode:想要对fab的返回值做一些事情，所以它编译为以下（未优化的）伪代码：

        push 16-byte case continuation frame <lbl0,copy_of_arg1> onto the stack
lbl_f:  -- code block for f a b, as above:
        load args a,b
        jump to f_entry   -- f_entry will jump to lbl0 when done
lbl0:   restore copy_of_arg1, pop case continuation frame
        if return_value == True jump to lbl2 else lbl1
lbl1:   block for body1
lbl2:   block for body2

Knowing this, the difference in line (6) between the two implementations is the pseudocode:知道了这一点，两个实现的第（6）行的区别就是伪代码：

-- True -> 1                              -- True -> (+) 1 a
load 1 as return value                    load args 1,a
jump to next continuation                 jump to "+"
                                          -- Note: "+" will jump to next contination

and the difference in line (14) between the two implementation is:并且这两种实现在第 (14) 行中的区别是：

-- case fib n_minus_1 of ...              -- fib' n_minus_1 y
        push case continuation <lbl_a>    load args n_minus_1,y
        load arg n_minus_1                jump to fib'
        jump to fib
lbl_a:  pop case continuation
        load args returned_val,y
        jump to "+"

There's actually hardly any performance difference between these once they're optimized.一旦优化，它们之间实际上几乎没有任何性能差异。 The assembly code generated for these blocks is:为这些块生成的汇编代码是：

-- True -> 1                              -- True -> (+) 1 a
                                          movq 16(%rbp),%rsi
movl $lvl_r83q_closure+1,%ebx             movl $lvl_r83q_closure+1,%r14d
addq $16,%rbp                             addq $24,%rbp
jmp *(%rbp)                               jmp plusInteger_info

-- case fib n_minus_1 of ...              -- fib' n_minus_1 y
movq $block_c89A_info,(%rbp)              movq 8(%rbp),%rax
movq %rbx,%r14                            addq $16,%rbp
jmp fib_info                              movq %rax,%rsi
movq 8(%rbp),%rsi                         movq %rbx,%r14
movq %rbx,%r14                            // fall through to start of fib'
addq $16,%rbp
jmp plusInteger_info

The difference here is a few instructions.这里的区别是一些说明。 A few more instructions are saved because the fall-through in fib' n_minus_1 y skips the overhead of a stack size check.由于fib' n_minus_1 y中fib' n_minus_1 y跳过了堆栈大小检查的开销，因此节省了更多指令。

In the version using Int , the additions and comparisons are all single instructions, and the difference between the two assemblies is -- by my count -- five instructions out of about 30 instructions total.在使用Int的版本中，加法和比较都是单个指令，两个程序集之间的区别是——据我统计——总共大约 30 条指令中有 5 条指令。 Because of the tight loop, that's enough to account for the 33% performance difference.由于紧密循环，这足以解释 33% 的性能差异。

So, the bottom line is that there's no fundamental structural reason that fib' is faster than fib , and the small performance improvement comes down to micro-optimizations on the order of a handful of instructions that the tail call allows.因此，底线是fib'比fib快没有根本的结构原因，并且小的性能改进归结为尾调用允许的少量指令顺序的微优化。

In other situations, reorganizing a function to introduce a tail call like this may or may not improve performance.在其他情况下，重新组织函数以引入像这样的尾调用可能会也可能不会提高性能。 This situation was probably unusual in that the reorganization of the function had very limited effect on the STG and so the net improvement of a few instructions wasn't swamped by other factors.这种情况可能是不寻常的，因为函数的重组对 STG 的影响非常有限，因此一些指令的净改进没有被其他因素淹没。

为什么这种斐波那契的尾调用比 Haskell 中的纯树递归运行得更快？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-04 22:05:32

为什么这种斐波那契的尾调用比 Haskell 中的纯树递归运行得更快？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-04 22:05:32

解决方案1
1 已采纳 2020-03-04 22:05:32