长度 vs 折叠 vs 显式递归的性能特征

Question

I've written six versions of the length function.我已经编写了length函数的六个版本。 Some of the performance differences make sense, but some of them don't seem to agree with the articles I've read at all (eg this one and this one ).一些性能差异是有道理的，但其中一些似乎根本不同意我读过的文章（例如this one和this one ）。

-- len1 and lenFold1 should have equivalent performance, right?

len1 :: [a] -> Integer
len1 [] = 0
len1 (x:xs) = len1 xs + 1

lenFold1 :: [a] -> Integer
lenFold1 = foldr (\_ a -> a + 1) 0


-- len2 and lenFold2 should have equivalent performance, right?

len2 :: [a] -> Integer
len2 xs = go xs 0 where
  go [] acc = acc
  go (x:xs) acc = go xs (1 + acc)

lenFold2 :: [a] -> Integer
lenFold2 = foldl (\a _ -> a + 1) 0


-- len3 and lenFold3 should have equivalent performance, right?
-- And len3 should outperform len1 and len2, right?

len3 :: [a] -> Integer
len3 xs = go xs 0 where
  go [] acc = acc
  go (x:xs) acc = go xs $! (1 + acc)

lenFold3 :: [a] -> Integer
lenFold3 = foldl' (\a _ -> a + 1) 0

The actual performance on my machine is puzzling.我机器上的实际性能令人费解。

*Main Lib> :set +m +s
*Main Lib> xs = [1..10000000]
(0.01 secs, 351,256 bytes)
*Main Lib> len1 xs
10000000
(5.47 secs, 2,345,244,016 bytes)
*Main Lib> lenFold1 xs
10000000
(2.74 secs, 1,696,750,840 bytes)
*Main Lib> len2 xs
10000000
(6.02 secs, 2,980,997,432 bytes)
*Main Lib> lenFold2 xs
10000000
(3.97 secs, 1,776,750,816 bytes)
*Main Lib> len3 xs
10000000
(5.24 secs, 3,520,354,616 bytes)
*Main Lib> lenFold3 xs
10000000
(1.24 secs, 1,040,354,528 bytes)
*Main Lib> length xs
10000000
(0.21 secs, 720,354,480 bytes)

My questions:我的问题：

Why does the fold version of each function consistently outperform the version using explicit recursion?为什么每个函数的fold版本始终优于使用显式递归的版本？
None of these implementations ever reaches a stack overflow on my machine, despite the warnings of this article .尽管有本文的警告，但这些实现都没有在我的机器上达到堆栈溢出。 Why not?为什么不？
Why doesn't len3 perform better than len1 or len2 ?为什么len3性能不如len1或len2 ？
Why does the Prelude's length perform so much better than any of these implementations?为什么 Prelude 的length比这些实现中的任何一个都表现得更好？

EDIT:编辑：

Thanks to Carl's suggestion, my first and second questions are addressed by the fact that GHCI interprets code by default.感谢 Carl 的建议，GHCI 默认解释代码这一事实解决了我的第一和第二个问题。 Running it again with -fobject-code accounts for the different performance between the explicit recursion and the fold.使用-fobject-code再次运行它可以-fobject-code显式递归和折叠之间的不同性能。 The new measurements:新测量：

Prelude Lib Main> xs = [1..10000000]
(0.00 secs, 354,136 bytes)
Prelude Lib Main> len1 xs
10000000
(1.62 secs, 1,612,661,544 bytes)
Prelude Lib Main> lenFold1 xs
10000000
(1.62 secs, 1,692,661,552 bytes)
Prelude Lib Main> len2 xs
10000000
(2.46 secs, 1,855,662,888 bytes)
Prelude Lib Main> lenFold2 xs
10000000
(2.53 secs, 1,772,661,528 bytes)
Prelude Lib Main> len3 xs
10000000
(0.48 secs, 1,680,361,272 bytes)
Prelude Lib Main> lenFold3 xs
10000000
(0.31 secs, 1,040,361,240 bytes)
Prelude Lib Main> length xs
10000000
(0.18 secs, 720,361,272 bytes)

I still have a few questions about this.关于这个我还有几个问题。

Why does lenFold3 outperform len3 ?为什么lenFold3优于len3 ？ I ran this a few times我跑了几次
How does length still outperform all of these implementations? length如何仍然优于所有这些实现？

Answer 1

I don't think you can properly test performance from GHCi, no matter what flags you try to use.我认为无论您尝试使用什么标志，您都无法正确测试 GHCi 的性能。

In general, the best way to do performance testing of Haskell code is to use the Criterion benchmarking library and compile with ghc -O2 .通常，对 Haskell 代码进行性能测试的最佳方法是使用 Criterion 基准测试库并使用ghc -O2编译。 Converted to a Criterion benchmark, your program looks like this:转换为 Criterion 基准，您的程序如下所示：

import Criterion.Main
import GHC.List
import Prelude hiding (foldr, foldl, foldl', length)

len1 :: [a] -> Integer
len1 [] = 0
len1 (x:xs) = len1 xs + 1

lenFold1 :: [a] -> Integer
lenFold1 = foldr (\_ a -> a + 1) 0

len2 :: [a] -> Integer
len2 xs = go xs 0 where
  go [] acc = acc
  go (x:xs) acc = go xs (1 + acc)

lenFold2 :: [a] -> Integer
lenFold2 = foldl (\a _ -> a + 1) 0

len3 :: [a] -> Integer
len3 xs = go xs 0 where
  go [] acc = acc
  go (x:xs) acc = go xs $! (1 + acc)

lenFold3 :: [a] -> Integer
lenFold3 = foldl' (\a _ -> a + 1) 0

testLength :: ([Int] -> Integer) -> Integer
testLength f = f [1..10000000]

main = defaultMain
  [ bench "lenFold1" $ whnf testLength lenFold1
  , bench "len1" $ whnf testLength len1
  , bench "len2" $ whnf testLength len2
  , bench "lenFold2" $ whnf testLength lenFold2
  , bench "len3" $ whnf testLength len3
  , bench "lenFold3" $ whnf testLength lenFold3
  , bench "length" $ whnf testLength (fromIntegral . length)
  ]

and the abbreviated results on my machine are:我机器上的缩写结果是：

len1                 190.9 ms   (136.8 ms .. 238.6 ms)
lenFold1             207.8 ms   (151.6 ms .. 248.6 ms)
len2                 69.96 ms   (69.09 ms .. 71.63 ms)
lenFold2             1.191 s    (917.1 ms .. 1.454 s)
len3                 69.26 ms   (69.20 ms .. 69.35 ms)
lenFold3             87.14 ms   (86.95 ms .. 87.35 ms)
length               26.78 ms   (26.50 ms .. 27.08 ms)

Note that these results are quite different from the performance you observed running these tests from GHCi, both in absolute and relative terms, and with or without -fobject-code .请注意，这些结果与您从 GHCi 运行这些测试所观察到的性能完全不同，无论是绝对值还是相对值，以及使用或不使用-fobject-code 。 Why?为什么？ Beats me.甘拜下风。

Anyway, based on this proper benchmark, len1 and lenFold1 have nearly identical performance.无论如何，基于这个适当的基准， len1和lenFold1具有几乎相同的性能。 In fact, the Core generated for lenFold1 is:实际上，为lenFold1生成的 Core 是：

lenFold1 = len1

so they are identical functions.所以它们是相同的功能。 The apparent difference in my benchmarks is real, though, and it appears to be the result of some cache/alignment issue.不过，我的基准测试中的明显差异是真实的，而且似乎是某些缓存/对齐问题的结果。 If I reorder len1 and lenFold1 in main , the performance difference flips around (so that len1 is the "slow one").如果我在main中对len1和lenFold1重新排序，性能差异就会翻转（因此len1是“慢的”）。

len2 and len3 also have identical performance because they are identical functions. len2和len3也具有相同的性能，因为它们是相同的功能。 (In fact, the generated code for len3 is len3 = len2 .) GHC's strictness analyser determines that the expression 1 + acc can be evaluated strictly, even without the explicit $! （实际上，为len3生成的代码是len3 = len2 。）GHC 的严格性分析器确定表达式1 + acc可以严格计算，即使没有显式的$! operator.操作员。

lenFold3 is slightly slower because foldl' isn't inlined, so the combining function needs an explicit call every time through. lenFold3稍微慢一些，因为foldl'没有内联，所以每次通过组合函数都需要显式调用。 This is arguably a bug that's been reported here .这可以说是这里报告的一个错误。 We can work around it by changing the definition of lenFold3 to explicitly provide three arguments to foldl' :我们可以通过更改lenFold3的定义来明确地为foldl'提供三个参数来解决foldl' ：

lenFold3 xs = foldl' (\a _ -> a + 1) 0 xs

and then it performs just as well as len2 and len3 :然后它的表现与len2和len3一样好：

lenFold3             66.99 ms   (66.76 ms .. 67.30 ms)

The abysmal performance of lenFold2 is a manifestation of the same problem. lenFold2的糟糕表现是同样问题的体现。 Without inlining, GHC can't perform proper optimization.如果没有内联，GHC 就无法进行适当的优化。 If we change the definition to:如果我们将定义改为：

lenFold2 xs = foldl (\a _ -> a + 1) 0 xs

it performs just as well as the others:它的表现和其他的一样好：

lenFold2             66.64 ms   (66.58 ms .. 66.68 ms)

To be clear, after making these two changes to lenFold2 and lenFold3 , the functions len2 , len3 , lenFold2 , and lenFold3 are all identical, except that lenFold2 and lenFold3 apply the + operator in a different order.需要明确的是，在对lenFold2和lenFold3进行这两个更改后，函数len2 、 len3 、 lenFold2和lenFold3都是相同的，只是lenFold2和lenFold3以不同的顺序应用+运算符。 If we use the definitions:如果我们使用定义：

lenFold2 xs = foldl (\a _ -> 1 + a) 0 xs
lenFold3 xs = foldl' (\a _ -> 1 + a) 0 xs

then the generated Core (which you can view with ghc -O2 -ddump-simpl -dsuppress-all -dsuppress-uniques -fforce-recomp ) is actually:那么生成的核心（您可以使用ghc -O2 -ddump-simpl -dsuppress-all -dsuppress-uniques -fforce-recomp ）实际上是：

len2 = ...actual definition...
lenFold2 = len2
len3 = len2
lenFold3 = len2

so they're all precisely identical.所以它们完全相同。

They are genuinely different from len1 (or equivalently lenFold1 ) because len1 builds up a large set of stack frames that it then needs to process when it gets to the end of the list and "discovers" that an empty list has length zero.它们与len1 （或等效的lenFold1 ）真正不同，因为len1构建了大量堆栈帧，然后当它到达列表末尾并“发现”空列表长度为零时需要处理这些帧。 The reason there's no stack overflow is that a lot of the blog posts about Haskell stack overflows appears to be either obsolete or based on GHCi tests.没有堆栈溢出的原因是很多关于 Haskell 堆栈溢出的博客文章似乎已经过时或基于 GHCi 测试。 In code compiled with modern GHC versions, the maximum stack size defaults to 80% of physical memory, so you can use gigabytes of stack without really noticing.在使用现代 GHC 版本编译的代码中，最大堆栈大小默认为物理内存的 80%，因此您可以在不注意的情况下使用千兆字节的堆栈。 In this case, some quick profiling with +RTS -hT shows that the stack grows to about 60-70 megabytes for a single len1 [1..10000000] , not nearly enough to overflow anything.在这种情况下，使用+RTS -hT一些快速分析表明，对于单个len1 [1..10000000]堆栈增长到大约 60-70 兆字节，几乎不足以溢出任何内容。 In contrast, the len2 family doesn't accumulate any appreciable stack.相比之下， len2系列没有积累任何可观的堆栈。

Finally, the reason length blows them all away is that it calculates the length using an Int instead of an Integer .最后， length将它们全部吹走的原因是它使用Int而不是Integer来计算长度。 If I change type signatures to:如果我将类型签名更改为：

len1 :: [a] -> Int
len2 :: [a] -> Int

then I get:然后我得到：

len1                 144.7 ms   (121.8 ms .. 157.9 ms)
len2                 27.38 ms   (27.31 ms .. 27.44 ms)
length               27.50 ms   (27.45 ms .. 27.54 ms)

and len2 (and so lenFold2 , len3 , and lenFold3 ) are all as fast as length .和len2 （等lenFold2 ， len3 ，和lenFold3 ）都尽可能快地length 。

长度 vs 折叠 vs 显式递归的性能特征

问题描述

1 个解决方案

解决方案1
12 已采纳 2020-09-11 21:27:44

长度 vs 折叠 vs 显式递归的性能特征

问题描述

1 个解决方案

解决方案1 12 已采纳 2020-09-11 21:27:44

解决方案1
12 已采纳 2020-09-11 21:27:44