Why does this kind of tail call of fibonacci run faster than pure tree recursion in Haskell?

Question

I'm trying to undestand tail call recursions. I convert pure tree-recursion fibonacci function:

fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

to a tail call version:

fib' 0 a = a
fib' 1 a = 1 + a
fib' n a = fib' (n-1) (fib' (n-2) a)

When I try these two versions, it seems that the second one is faster than the first tree-recusion even though I tried to use seq to force strict evaluation in the second one!

How does Haskell treat such tail calls inside GHC? Thanks!

Answer 1

Performance of code tested at the GHCi interactive prompt can be quite misleading, so when benchmarking GHC code, it's a good idea to test it in a standalone executable compiled with ghc -O2 . Adding explicit type signatures and making sure -Wall doesn't report any warnings about "defaulting" types is helpful, too. Otherwise, GHC may choose default numeric types that you didn't intend. Finally, it's also a good idea to use the criterion benchmarking library, since it does a good job generating reliable and reproducible timing results.

Benchmarking your two fib versions this way with the program:

import Criterion.Main

fib :: Integer -> Integer
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

fib' :: Integer -> Integer -> Integer
fib' 0 a = a
fib' 1 a = 1 + a
fib' n a = fib' (n-1) (fib' (n-2) a)

main :: IO ()
main = defaultMain
  [ bench "fib" $ whnf fib 30
  , bench "fib'" $ whnf (fib' 30) 0
  ]

compiled with GHC 8.6.5 using ghc -O2 -Wall Fib2.hs , I get:

$ ./Fib2
benchmarking fib
time                 40.22 ms   (39.91 ms .. 40.45 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 39.91 ms   (39.51 ms .. 40.11 ms)
std dev              581.2 μs   (319.5 μs .. 906.9 μs)

benchmarking fib'
time                 38.88 ms   (38.69 ms .. 39.06 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 38.57 ms   (38.49 ms .. 38.67 ms)
std dev              188.7 μs   (139.6 μs .. 268.3 μs)

The difference here is quite small, but can be consistently reproduced. The fib' version is about 3-5% faster than the fib version.

At this point, it's maybe worth pointing out that my type signatures used Integer . This is also the default that GHC would have selected without explicit type signatures. Replacing these with Int results in a massive performance improvement:

benchmarking fib
time                 4.877 ms   (4.850 ms .. 4.908 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 4.766 ms   (4.730 ms .. 4.808 ms)
std dev              122.2 μs   (98.16 μs .. 162.4 μs)

benchmarking fib'
time                 3.295 ms   (3.260 ms .. 3.332 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 3.218 ms   (3.202 ms .. 3.240 ms)
std dev              62.51 μs   (44.57 μs .. 88.39 μs)

That's why I recommend including explicit type signatures and making sure there are no warnings about default types. Otherwise, you can spend a lot of time chasing tiny improvements when the real problem is a loop index that uses Integer when it could have used Int . For this example, of course, there's the additional issue that the algorithm is all wrong, since the algorithm is quadratic, and a linear implementation is possible, like the usual "clever Haskell" solution:

-- fib'' 30 runs about 100 times faster than fib 30
fib'' :: Int -> Int
fib'' n = fibs !! n
  where fibs = scanl (+) 0 (1:fibs)

Anyway, let's switch back to fib and fib' using Integer for the rest of this answer...

The GHC compiler produces an intermediate form of a program called the STG (spineless, tagless, G-machine). It's the highest-level representation that faithfully represents how the program will actually be run. The best documentation of STG and how it's actually translated into heap allocations and stack frames is the paper Making a fast curry: push/enter versus eval/apply for higher-order languages . When reading this paper, Figure 1 is the STG language (though the syntax differs from what GHC produces with -ddump-stg ) and Figure 2's first and third panels show how STG is evaluated using an eval/apply approach (which matches current GHC-generated code). There's also an older paper Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine that provides a lot more detail (probably too much), but it's a little out-of-date.

Anyway, to see the difference between fib and fib' , we can look at the generated STG using:

ghc -O2 -ddump-stg -dsuppress-all -fforce-recomp Fib2.hs

Taking the STG output and substantially cleaning it up to look more like "regular Haskell", I get the following definitions:

fib = \n ->                          fib' = \n a ->
  case (==) n 0 of                     case (==) n 0 of
    True -> 0                            True -> a;
    _ ->                                 _ ->
      case (==) n 1 of                     case (==) n 1 of
        True -> 1                            True -> (+) 1 a;                 -- (6)
        _ ->                                 _ ->
          case (-) n 2 of                      case (-) n 2 of
            n_minus_2 ->                         n_minus_2 ->
              case fib n_minus_2 of                case fib' n_minus_2 a of
                y ->                                 y ->
                  case (-) n 1 of                      case (-) n 1 of
                    n_minus_1 ->                         n_minus_1 ->
                      case fib n_minus_1 of                fib' n_minus_1 y   -- (14)
                        x -> (+) x y

Here, strictness analysis has already made the entire computation strict. There are no thunks created here. (In STG, only let blocks create thunks, and there are no let blocks in this STG.) So, the (minimal) performance difference between these two implementations has nothing to do with strict versus lazy.

Ignoring the extra argument to fib' , note that these two implementations are essentially structurally identical except for the addition operation in line (6) in fib' and the case statement with addition operation in line (14) in fib .

To understand the difference between these two implementations, you first need to understand that a function call fab is compiled to the pseudocode:

lbl_f:  load args a,b
        jump to f_entry

Note that all function calls, whether or not they are tail calls, are compiled to jumps like this. When the code in f_entry completes, it will jump to whatever continuation frame is at the top of the stack, so if the caller wants to do something with the result of a function call, it should push a continuation frame before jumping.

For example, the block of code:

case f a b of
    True -> body1
    _    -> body2

wants to do something with the return value of fab , so it compiles to the following (unoptimized) pseudocode:

        push 16-byte case continuation frame <lbl0,copy_of_arg1> onto the stack
lbl_f:  -- code block for f a b, as above:
        load args a,b
        jump to f_entry   -- f_entry will jump to lbl0 when done
lbl0:   restore copy_of_arg1, pop case continuation frame
        if return_value == True jump to lbl2 else lbl1
lbl1:   block for body1
lbl2:   block for body2

Knowing this, the difference in line (6) between the two implementations is the pseudocode:

-- True -> 1                              -- True -> (+) 1 a
load 1 as return value                    load args 1,a
jump to next continuation                 jump to "+"
                                          -- Note: "+" will jump to next contination

and the difference in line (14) between the two implementation is:

-- case fib n_minus_1 of ...              -- fib' n_minus_1 y
        push case continuation <lbl_a>    load args n_minus_1,y
        load arg n_minus_1                jump to fib'
        jump to fib
lbl_a:  pop case continuation
        load args returned_val,y
        jump to "+"

There's actually hardly any performance difference between these once they're optimized. The assembly code generated for these blocks is:

-- True -> 1                              -- True -> (+) 1 a
                                          movq 16(%rbp),%rsi
movl $lvl_r83q_closure+1,%ebx             movl $lvl_r83q_closure+1,%r14d
addq $16,%rbp                             addq $24,%rbp
jmp *(%rbp)                               jmp plusInteger_info

-- case fib n_minus_1 of ...              -- fib' n_minus_1 y
movq $block_c89A_info,(%rbp)              movq 8(%rbp),%rax
movq %rbx,%r14                            addq $16,%rbp
jmp fib_info                              movq %rax,%rsi
movq 8(%rbp),%rsi                         movq %rbx,%r14
movq %rbx,%r14                            // fall through to start of fib'
addq $16,%rbp
jmp plusInteger_info

The difference here is a few instructions. A few more instructions are saved because the fall-through in fib' n_minus_1 y skips the overhead of a stack size check.

In the version using Int , the additions and comparisons are all single instructions, and the difference between the two assemblies is -- by my count -- five instructions out of about 30 instructions total. Because of the tight loop, that's enough to account for the 33% performance difference.

So, the bottom line is that there's no fundamental structural reason that fib' is faster than fib , and the small performance improvement comes down to micro-optimizations on the order of a handful of instructions that the tail call allows.

In other situations, reorganizing a function to introduce a tail call like this may or may not improve performance. This situation was probably unusual in that the reorganization of the function had very limited effect on the STG and so the net improvement of a few instructions wasn't swamped by other factors.

Why does this kind of tail call of fibonacci run faster than pure tree recursion in Haskell?

Question

1 answers

solution1
1 ACCPTED 2020-03-04 22:05:32

Why does this kind of tail call of fibonacci run faster than pure tree recursion in Haskell?

Question

1 answers

solution1 1 ACCPTED 2020-03-04 22:05:32

solution1
1 ACCPTED 2020-03-04 22:05:32