[英]Using GHC's profiling stats/charts to identify trouble-areas / improve performance of Haskell code
TL;DR: Based on the Haskell code and it's associated profiling data below, what conclusions can we draw that let us modify/improve it so we can narrow the performance gap vs. the same algorithm written in imperative languages (namely C++ / Python / C# but the specific language isn't important)? TL; DR:基于Haskell代码及其相关的分析数据,我们可以得出哪些结论让我们修改/改进它,这样我们可以缩小性能差距,而不是用命令式语言编写的相同算法(即C ++ / Python / C#但具体语言不重要)?
I wrote the following piece of code as an answer to a question on a popular site which contains many questions of a programming and/or mathematical nature. 我写了下面一段代码作为一个流行网站上的问题的答案,其中包含许多编程和/或数学性质的问题。 (You've probably heard of this site, whose name is pronounced "oiler" by some, "yoolurr" by others.) Since the code below is a solution to one of the problems, I'm intentionally avoiding any mention of the site's name or any specific terms in the problem.
(您可能听说过这个网站,其名称由某些人发音为“oiler”,其他人则称为“yoolurr”。)由于下面的代码是其中一个问题的解决方案,我故意避免提及该网站的任何内容。名称或问题中的任何特定术语。 That said, I'm talking about problem one hundred and three.
那就是说,我说的是问题一百零三。
(In fact, I've seen many solutions in the site's forums from resident Haskell wizards :P) (事实上,我在常驻Haskell向导的网站论坛中看到了很多解决方案:P)
This was the first problem (on said site) in which I encountered a difference in performance (as measured by execution time) between Haskell code vs. C++/Python/C# code (when both use a similar algorithm). 这是第一个问题(在所说的网站上),我遇到了Haskell代码与C ++ / Python / C#代码之间的性能差异(以执行时间衡量)(当两者都使用类似的算法时)。 In fact, it was the case for all of the problems (thus far; I've done ~100 problems but not sequentially) that an optimized Haskell code was pretty much neck-and-neck with the fastest C++ solutions, ceteris paribus for the algorithm, of course.
实际上,所有这些问题(迄今为止;我已经完成了大约100个问题但不是顺序问题)的情况就是优化的Haskell代码与最快的C ++解决方案并驾齐驱,其中包括其他条件。算法,当然。
However, the posts in the forum for this particular problem would indicate that the same algorithm in these other languages typically require at most one or two seconds, with the longest taking 10-15 sec (assuming the same starting parameters; I'm ignoring the very naive algorithms that take 2-3 min+). 但是,论坛中针对此特定问题的帖子表明,这些其他语言中的相同算法通常最多需要一到两秒钟,最长需要10-15秒(假设相同的起始参数;我忽略了非常天真的算法需要2-3分钟+)。 In contrast, the Haskell code below required ~50 sec on my (decent) computer (with profiling disabled; with profiling enabled, it takes ~2 min, as you can see below; note: the exec time was identical when compiling with
-fllvm
). 相比之下,下面的Haskell代码在我的(正常)计算机上需要约50秒(禁用分析;启用分析后,需要约2分钟,如下所示;注意:使用
-fllvm
编译时,执行时间相同)。 Specs: i5 2.4ghz laptop, 8gb RAM. 规格:i5 2.4ghz笔记本电脑,8GB RAM。
In an effort to learn Haskell in a way that it can become a viable substitute to the imperative languages, one of my aims in solving these problems is learning to write code that, to the extent possible, has performance that's on par with those imperative languages. 为了努力学习Haskell,它可以成为命令式语言的可行替代品,我解决这些问题的目的之一是学习编写尽可能具有与那些命令式语言相同的性能的代码。 。 In that context, I still consider the problem as yet unsolved by me (since there's nearly a ~25x difference in performance!)
在这种情况下,我仍然认为这个问题尚未解决(因为性能差异大约为25倍!)
In addition to the obvious step of streamlining the code itself (to the best of my ability), I've also performed the standard profiling exercises that are recommended in "Real World Haskell". 除了简化代码本身(尽我所能)的明显步骤之外,我还执行了“真实世界Haskell”中推荐的标准分析练习。
But I'm having a hard time drawing conclusions that that tell me which pieces need to be modified. 但是我很难得出结论,告诉我需要修改哪些部分。 That's where I'm hoping folks might be able to help provide some guidance.
这就是我希望人们可以提供一些指导的地方。
I'd refer you to the website of problem one hundred and three on the aforementioned site but here's a brief summary: the goal is to find a group of seven numbers such that any two disjoint subgroups (of that group) satisfy the following two properties (I'm trying to avoid using the 'set' word for reasons mentioned above...): 我将在上述网站上向您推荐问题一百零三的网站,但这里有一个简短的总结:目标是找到一组七个数字,这样任何两个不相交的子组(该组)满足以下两个属性(我试图避免因上述原因而使用'set'一词......):
In particular, we are trying to find the group of seven numbers with the smallest sum. 特别是,我们试图找到具有最小总和的七个数字组。
A warning: some of these comments may well be totally wrong but I wanted to atleast take a stab at interpreting the profiling data based on what I read in Real World Haskell and other profiling-related posts on SO. 一个警告:其中一些评论可能完全错误但我想至少根据我在Real World Haskell中读到的内容以及SO上的其他与分析相关的帖子来解释分析数据。
value
sub-function, which determines values to fill in the dynamic programming ("DP") table, (ii) 29.1% in the table
function, which generates the DP table and (iii) 12.4% in the rule1
function, which checks the resulting DP table to make sure that a given sum can only be calculated in one way (ie, from one subgroup). value
子函数中的41.6%,其确定填充动态编程(“DP”)表的值,(ii)29.1 table
函数中的%,生成DP表,以及(iii) rule1
函数中的12.4%,它检查生成的DP表,以确保给定的总和只能以一种方式计算(即,从一个子组)。 value
function relative to the table
and rule1
functions given that it's the only one of the three which doesn't construct an array or filter through a large number of elements (it's really only performing O(1) lookups and making comparisons between Int
types, which you'd think would be relatively quick). table
和rule1
函数在value
函数中花费了更多的时间,因为它是三个中唯一没有构造数组或通过大量元素过滤的函数(它确实是只执行O(1)查找并在Int
类型之间进行比较,你认为它们相对较快。 So this is a potential problem area. value
function is driving the high heap-allocation value
函数不太可能推动高堆分配 Frankly, I'm not sure what to make of the three charts. 坦率地说,我不确定如何制作三张图表。
Heap profile chart (ie, the first char below): 堆配置文件图表(即下面的第一个字符):
Pinned
. Pinned
的红色区域代表什么。 It makes sense that the dynamic
function has a "spiky" memory allocation because it's called every time the construct
function generates a tuple that meets the first three criteria and, each time it's called, it creates a decently large DP array. dynamic
函数具有“尖峰”内存分配,因为每次construct
函数生成满足前三个条件的元组时都会调用它,并且每次调用它时,它都会创建一个相当大的DP数组。 Also, I'd think that the allocation of memory to store the tuples (generated by construct) wouldn't be flat over the course of the program. Allocation by type and allocation by constructor: 按类型分配和按构造函数分配:
ARR_WORDS
(which represents a ByteString or unboxed Array according to the GHC docs) represents the low-level execution of the construction of the DP array (in the table
function). ARR_WORDS
(根据GHC文档表示ByteString或未装箱的数组)表示DP阵列构造的低级执行(在table
函数中)。 Nut I'm not 100% sure. FROZEN
and STATIC
pointer categories correspond to. FROZEN
和STATIC
指针类别对应的是什么。 Without further ado, here's the code with comments explaining my algorithm. 不用多说,这里是带有解释我的算法的注释的代码 。 I've tried to make sure the code doesn't run off of the right-side of the code-box - but some of the comments do require scrolling (sorry).
我试图确保代码不会从代码框的右侧运行 - 但是一些注释确实需要滚动(抱歉)。
{-# LANGUAGE NoImplicitPrelude #-}
{-# OPTIONS_GHC -Wall #-}
import CorePrelude
import Data.Array
import Data.List
import Data.Bool.HT ((?:))
import Control.Monad (guard)
main = print (minimum construct)
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
--we enumerate tuples that are potentially valid and then
--filter for valid ones; we perform the most computationally
--expensive step (i.e., rule 1) at the very end
construct :: [[Int]]
construct = {-# SCC "construct" #-} do
a <- [flr..cap] --1st: we construct potentially valid tuples while applying a
b <- [a+step..cap] --constraint on the upper bound of any element as implied by rule 2
c <- [b+step..a+b-1]
d <- [c+step..a+b-1]
e <- [d+step..a+b-1]
f <- [e+step..a+b-1]
g <- [f+step..a+b-1]
guard (a + b + c + d - e - f - g > 0) --2nd: we screen for tuples that completely conform to rule 2
let nn = [g,f,e,d,c,b,a]
guard (sum nn < 285) --3rd: we screen for tuples of a certain size (a guess to speed things up)
guard (rule1 nn) --4th: we screen for tuples that conform to rule 1
return nn
rule1 :: [Int] -> Bool
rule1 nn = {-# SCC "rule1" #-}
null . filter ((>1) . snd) --confirm that there's only one subgroup that sums to any given sum
. filter ((length nn==) . snd . fst) --the last column us how many subgroups sum to a given sum
. assocs --run the dynamic programming algorithm and generate a table
$ dynamic nn
dynamic :: [Int] -> Array (Int,Int) Int
dynamic ns = {-# SCC "dynamic" #-} table
where
(len, maxSum) = (length &&& sum) ns
table = array ((0,0),(maxSum,len))
[ ((s,i),x) | s <- [0..maxSum], i <- [0..len], let x = value (s,i) ]
elements = listArray (0,len) (0:ns)
value (s,i)
| i == 0 || s == 0 = 0
| s == m = table ! (s,i-1) + 1
| s > m = s <= sum (take i ns) ?:
(table ! (s,i-1) + table ! ((s-m),i-1), 0)
| otherwise = 0
where
m = elements ! i
Stats on heap allocation, garbage collection and time elapsed: 堆分配,垃圾收集和已用时间的统计信息:
% ghc -O2 --make 103_specialsubset2.hs -rtsopts -prof -auto-all -caf-all -fforce-recomp
[1 of 1] Compiling Main ( 103_specialsubset2.hs, 103_specialsubset2.o )
Linking 103_specialsubset2 ...
% time ./103_specialsubset2.hs +RTS -p -sstderr
zsh: permission denied: ./103_specialsubset2.hs
./103_specialsubset2.hs +RTS -p -sstderr 0.00s user 0.00s system 86% cpu 0.002 total
% time ./103_specialsubset2 +RTS -p -sstderr
SOLUTION REDACTED
172,449,596,840 bytes allocated in the heap
21,738,677,624 bytes copied during GC
261,128 bytes maximum residency (74 sample(s))
55,464 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 327548 colls, 0 par 27.34s 41.64s 0.0001s 0.0092s
Gen 1 74 colls, 0 par 0.02s 0.02s 0.0003s 0.0013s
INIT time 0.00s ( 0.01s elapsed)
MUT time 53.91s ( 70.60s elapsed)
GC time 27.35s ( 41.66s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 81.26s (112.27s elapsed)
%GC time 33.7% (37.1% elapsed)
Alloc rate 3,199,123,974 bytes per MUT second
Productivity 66.3% of total user, 48.0% of total elapsed
./103_specialsubset2 +RTS -p -sstderr 81.26s user 30.90s system 99% cpu 1:52.29 total
Stats on time spent per cost-centre: 每个成本中心花费的统计时间:
Wed Dec 17 23:21 2014 Time and Allocation Profiling Report (Final)
103_specialsubset2 +RTS -p -sstderr -RTS
total time = 15.56 secs (15565 ticks @ 1000 us, 1 processor)
total alloc = 118,221,354,488 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
dynamic.value Main 41.6 17.7
dynamic.table Main 29.1 37.8
construct Main 12.9 37.4
rule1 Main 12.4 7.0
dynamic.table.x Main 1.9 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 55 0 0.0 0.0 100.0 100.0
main Main 111 0 0.0 0.0 0.0 0.0
CAF:main1 Main 108 0 0.0 0.0 0.0 0.0
main Main 110 1 0.0 0.0 0.0 0.0
CAF:main2 Main 107 0 0.0 0.0 0.0 0.0
main Main 112 0 0.0 0.0 0.0 0.0
CAF:main3 Main 106 0 0.0 0.0 0.0 0.0
main Main 113 0 0.0 0.0 0.0 0.0
CAF:construct Main 105 0 0.0 0.0 100.0 100.0
construct Main 114 1 0.6 0.0 100.0 100.0
construct Main 115 1 12.9 37.4 99.4 100.0
rule1 Main 123 282235 0.6 0.0 86.5 62.6
rule1 Main 124 282235 12.4 7.0 85.9 62.6
dynamic Main 125 282235 0.2 0.0 73.5 55.6
dynamic.elements Main 133 282235 0.3 0.1 0.3 0.1
dynamic.len Main 129 282235 0.0 0.0 0.0 0.0
dynamic.table Main 128 282235 29.1 37.8 72.9 55.5
dynamic.table.x Main 130 133204473 1.9 0.0 43.8 17.7
dynamic.value Main 131 133204473 41.6 17.7 41.9 17.7
dynamic.value.m Main 132 132640003 0.3 0.0 0.3 0.0
dynamic.maxSum Main 127 282235 0.0 0.0 0.0 0.0
dynamic.(...) Main 126 282235 0.1 0.0 0.1 0.0
dynamic Main 122 282235 0.0 0.0 0.0 0.0
construct.nn Main 121 12683926 0.0 0.0 0.0 0.0
CAF:main4 Main 102 0 0.0 0.0 0.0 0.0
construct Main 116 0 0.0 0.0 0.0 0.0
construct Main 117 0 0.0 0.0 0.0 0.0
CAF:cap Main 101 0 0.0 0.0 0.0 0.0
cap Main 119 1 0.0 0.0 0.0 0.0
CAF:flr Main 100 0 0.0 0.0 0.0 0.0
flr Main 118 1 0.0 0.0 0.0 0.0
CAF:step_r1dD Main 99 0 0.0 0.0 0.0 0.0
step Main 120 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 96 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 93 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 91 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 82 0 0.0 0.0 0.0 0.0
Heap profile: 堆配置文件:
Allocation by type: 按类型分配:
Allocation by constructors: 由构造函数分配:
There is a lot that can be said. 有很多可以说的。 In this answer I'll just comment on the nested list comprehensions in the
construct
function. 在这个答案中,我将仅对
construct
函数中的嵌套列表推导进行评论。
To get an idea on what's going on in construct
we'll isolate it and compare it to a nested loop version that you would write in an imperative language. 为了了解
construct
发生的事情,我们将其隔离并将其与您用命令式语言编写的嵌套循环版本进行比较。 We've removed the rule1
guard to test only the generation of lists. 我们删除了
rule1
guard以仅测试列表的生成。
-- List.hs -- using list comprehensions
import Control.Monad
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
construct :: [[Int]]
construct = do
a <- [flr..cap]
b <- [a+step..cap]
c <- [b+step..a+b-1]
d <- [c+step..a+b-1]
e <- [d+step..a+b-1]
f <- [e+step..a+b-1]
g <- [f+step..a+b-1]
guard (a + b + c + d - e - f - g > 0)
guard (a + b + c + d + e + f + g < 285)
return [g,f,e,d,c,b,a]
-- guard (rule1 nn)
main = do
forM_ construct print
-- Loops.hs -- using imperative looping
import Control.Monad
loop a b f = go a
where go i | i > b = return ()
| otherwise = do f i; go (i+1)
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
main =
loop flr cap $ \a ->
loop (a+step) cap $ \b ->
loop (b+step) (a+b-1) $ \c ->
loop (c+step) (a+b-1) $ \d ->
loop (d+step) (a+b-1) $ \e ->
loop (e+step) (a+b-1) $ \f ->
loop (f+step) (a+b-1) $ \g ->
if (a+b+c+d-e-f-g > 0) && (a+b+c+d+e+f+g < 285)
then print [g,f,e,d,c,b,a]
else return ()
Both programs were compiled with ghc -O2 -rtsopts
and run with prog +RTS -s > out
. 两个程序都使用
ghc -O2 -rtsopts
编译,并使用prog +RTS -s > out
。
Here is a summary of the results: 以下是结果摘要:
Lists.hs Loops.hs
Heap allocation 44,913 MB 2,740 MB
Max. Residency 44,312 44,312
%GC 5.8 % 1.7 %
Total Time 9.48 secs 1.43 secs
As you can see, the loop version, which is the way you would write this in a language like C, wins in every category. 正如您所看到的,循环版本,即用C语言编写的方式,在每个类别中都会获胜。
The list comprehension version is cleaner and more composable but also less performant than direct iteration. 列表推导版本更清晰,更易于组合,但性能也不如直接迭代。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.