简体   繁体   English

使用GHC的分析统计数据/图表来识别故障区域/提高Haskell代码的性能

[英]Using GHC's profiling stats/charts to identify trouble-areas / improve performance of Haskell code

TL;DR: Based on the Haskell code and it's associated profiling data below, what conclusions can we draw that let us modify/improve it so we can narrow the performance gap vs. the same algorithm written in imperative languages (namely C++ / Python / C# but the specific language isn't important)? TL; DR:基于Haskell代码及其相关的分析数据,我们可以得出哪些结论让我们修改/改进它,这样我们可以缩小性能差距,而不是用命令式语言编写的相同算法(即C ++ / Python / C#但具体语言不重要)?

Background 背景

I wrote the following piece of code as an answer to a question on a popular site which contains many questions of a programming and/or mathematical nature. 我写了下面一段代码作为一个流行网站上的问题的答案,其中包含许多编程和/或数学性质的问题。 (You've probably heard of this site, whose name is pronounced "oiler" by some, "yoolurr" by others.) Since the code below is a solution to one of the problems, I'm intentionally avoiding any mention of the site's name or any specific terms in the problem. (您可能听说过这个网站,其名称由某些人发音为“oiler”,其他人则称为“yoolurr”。)由于下面的代码是其中一个问题的解决方案,我故意避免提及该网站的任何内容。名称或问题中的任何特定术语。 That said, I'm talking about problem one hundred and three. 那就是说,我说的是问题一百零三。

(In fact, I've seen many solutions in the site's forums from resident Haskell wizards :P) (事实上​​,我在常驻Haskell向导的网站论坛中看到了很多解决方案:P)

Why did I choose to profile this code? 为什么我选择分析此代码?

This was the first problem (on said site) in which I encountered a difference in performance (as measured by execution time) between Haskell code vs. C++/Python/C# code (when both use a similar algorithm). 这是第一个问题(在所说的网站上),我遇到了Haskell代码与C ++ / Python / C#代码之间的性能差异(以执行时间衡量)(当两者都使用类似的算法时)。 In fact, it was the case for all of the problems (thus far; I've done ~100 problems but not sequentially) that an optimized Haskell code was pretty much neck-and-neck with the fastest C++ solutions, ceteris paribus for the algorithm, of course. 实际上,所有这些问题(迄今为止;我已经完成了大约100个问题但不是顺序问题)的情况就是优化的Haskell代码与最快的C ++解决方案并驾齐驱,其中包括其他条件。算法,当然。

However, the posts in the forum for this particular problem would indicate that the same algorithm in these other languages typically require at most one or two seconds, with the longest taking 10-15 sec (assuming the same starting parameters; I'm ignoring the very naive algorithms that take 2-3 min+). 但是,论坛中针对此特定问题的帖子表明,这些其他语言中的相同算法通常最多需要一到两秒钟,最长需要10-15秒(假设相同的起始参数;我忽略了非常天真的算法需要2-3分钟+)。 In contrast, the Haskell code below required ~50 sec on my (decent) computer (with profiling disabled; with profiling enabled, it takes ~2 min, as you can see below; note: the exec time was identical when compiling with -fllvm ). 相比之下,下面的Haskell代码在我的(正常)计算机上需要约50秒(禁用分析;启用分析后,需要约2分钟,如下所示;注意:使用-fllvm编译时,执行时间相同)。 Specs: i5 2.4ghz laptop, 8gb RAM. 规格:i5 2.4ghz笔记本电脑,8GB RAM。

In an effort to learn Haskell in a way that it can become a viable substitute to the imperative languages, one of my aims in solving these problems is learning to write code that, to the extent possible, has performance that's on par with those imperative languages. 为了努力学习Haskell,它可以成为命令式语言的可行替代品,我解决这些问题的目的之一是学习编写尽可能具有与那些命令式语言相同的性能的代码。 。 In that context, I still consider the problem as yet unsolved by me (since there's nearly a ~25x difference in performance!) 在这种情况下,我仍然认为这个问题尚未解决(因为性能差异大约为25倍!)

What have I done so far? 到目前为止我做了什么?

In addition to the obvious step of streamlining the code itself (to the best of my ability), I've also performed the standard profiling exercises that are recommended in "Real World Haskell". 除了简化代码本身(尽我所能)的明显步骤之外,我还执行了“真实世界Haskell”中推荐的标准分析练习。

But I'm having a hard time drawing conclusions that that tell me which pieces need to be modified. 但是我很难得出结论,告诉我需要修改哪些部分。 That's where I'm hoping folks might be able to help provide some guidance. 这就是我希望人们可以提供一些指导的地方。

Description of the problem: 问题描述:

I'd refer you to the website of problem one hundred and three on the aforementioned site but here's a brief summary: the goal is to find a group of seven numbers such that any two disjoint subgroups (of that group) satisfy the following two properties (I'm trying to avoid using the 'set' word for reasons mentioned above...): 我将在上述网站上向您推荐问题一百零三的网站,但这里有一个简短的总结:目标是找到一组七个数字,这样任何两个不相交的子组(该组)满足以下两个属性(我试图避免因上述原因而使用'set'一词......):

  • no two subgroups sum to the same amount 没有两个小组总和相同的数量
  • the subgroup with more elements has a larger sum (in other words, the sum of the smallest four elements must be greater than the sum of the largest three elements). 具有更多元素的子组具有更大的和(换句话说,最小的四个元素的总和必须大于最大的三个元素的总和)。

In particular, we are trying to find the group of seven numbers with the smallest sum. 特别是,我们试图找到具有最小总和的七个数字组。

My (admittedly weak) observations 我的(无可否认的是弱势)观察

A warning: some of these comments may well be totally wrong but I wanted to atleast take a stab at interpreting the profiling data based on what I read in Real World Haskell and other profiling-related posts on SO. 一个警告:其中一些评论可能完全错误但我想至少根据我在Real World Haskell中读到的内容以及SO上的其他与分析相关的帖子来解释分析数据。

  • There does indeed seem to be an efficiency issue seeing as how one-third of the time is spent doing garbage collection (37.1%). 确实存在效率问题,因为三分之一的时间用于垃圾收集(37.1%)。 The first table of figures shows that ~172gb is allocated in the heap, which seems horrible... (Maybe there's a better structure / function to use for implementing the dynamic programming solution?) 第一个数据表显示堆中分配了~172gb,这看起来很糟糕......(也许有更好的结构/功能用于实现动态编程解决方案?)
  • Not surprisingly, the vast majority (83.1%) of time is spent checking rule 1: (i) 41.6% in the value sub-function, which determines values to fill in the dynamic programming ("DP") table, (ii) 29.1% in the table function, which generates the DP table and (iii) 12.4% in the rule1 function, which checks the resulting DP table to make sure that a given sum can only be calculated in one way (ie, from one subgroup). 毫不奇怪,绝大多数(83.1%)的时间用于检查规则1:(i) value子函数中的41.6%,其确定填充动态编程(“DP”)表的值,(ii)29.1 table函数中的%,生成DP表,以及(iii) rule1函数中的12.4%,它检查生成的DP表,以确保给定的总和只能以一种方式计算(即,从一个子组)。
  • However, I did find it surprising that more time was spent in the value function relative to the table and rule1 functions given that it's the only one of the three which doesn't construct an array or filter through a large number of elements (it's really only performing O(1) lookups and making comparisons between Int types, which you'd think would be relatively quick). 但是,我确实发现令人惊讶的是,相对于tablerule1函数在value函数中花费了更多的时间,因为它是三个中唯一没有构造数组或通过大量元素过滤的函数(它确实是只执行O(1)查找并在Int类型之间进行比较,你认为它们相对较快。 So this is a potential problem area. 所以这是一个潜在的问题领域。 That said, it's unlikely that the value function is driving the high heap-allocation 也就是说, value函数不太可能推动高堆分配

Frankly, I'm not sure what to make of the three charts. 坦率地说,我不确定如何制作三张图表。

Heap profile chart (ie, the first char below): 堆配置文件图表(即下面的第一个字符):

  • I'm honestly not sure what is represented by the red area marked as Pinned . 老实说,我不确定标记为Pinned的红色区域代表什么。 It makes sense that the dynamic function has a "spiky" memory allocation because it's called every time the construct function generates a tuple that meets the first three criteria and, each time it's called, it creates a decently large DP array. 有意义的是, dynamic函数具有“尖峰”内存分配,因为每次construct函数生成满足前三个条件的元组时都会调用它,并且每次调用它时,它都会创建一个相当大的DP数组。 Also, I'd think that the allocation of memory to store the tuples (generated by construct) wouldn't be flat over the course of the program. 此外,我认为存储元组(由构造生成)的内存分配在程序过程中不会是平坦的。
  • Pending clarification of the "Pinned" red area, I'm not sure this one tells us anything useful. 在澄清“固定”红色区域时,我不确定这个区域告诉我们什么有用。

Allocation by type and allocation by constructor: 按类型分配和按构造函数分配:

  • I suspect that the ARR_WORDS (which represents a ByteString or unboxed Array according to the GHC docs) represents the low-level execution of the construction of the DP array (in the table function). 我怀疑ARR_WORDS (根据GHC文档表示ByteString或未装箱的数组)表示DP阵列构造的低级执行(在table函数中)。 Nut I'm not 100% sure. 坚果我不是百分百肯定。
  • I'm not sure what's the FROZEN and STATIC pointer categories correspond to. 我不确定FROZENSTATIC指针类别对应的是什么。
  • Like I said, I'm really not sure how to interpret the charts as nothing jumps out (to me) as unexpected. 就像我说的那样,我真的不确定如何解释图表,因为没有任何事情(对我来说)意外。

The code and the profiling results 代码和分析结果

Without further ado, here's the code with comments explaining my algorithm. 不用多说,这里是带有解释我的算法的注释的代码 I've tried to make sure the code doesn't run off of the right-side of the code-box - but some of the comments do require scrolling (sorry). 我试图确保代码不会从代码框的右侧运行 - 但是一些注释确实需要滚动(抱歉)。

{-# LANGUAGE NoImplicitPrelude #-}
{-# OPTIONS_GHC -Wall #-}

import CorePrelude
import Data.Array
import Data.List
import Data.Bool.HT ((?:))
import Control.Monad (guard)

main = print (minimum construct)

cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int

--we enumerate tuples that are potentially valid and then
--filter for valid ones; we perform the most computationally
--expensive step (i.e., rule 1) at the very end
construct :: [[Int]]
construct = {-# SCC "construct" #-} do
  a <- [flr..cap]                         --1st: we construct potentially valid tuples while applying a
  b <- [a+step..cap]                      --constraint on the upper bound of any element as implied by rule 2
  c <- [b+step..a+b-1]
  d <- [c+step..a+b-1]
  e <- [d+step..a+b-1]
  f <- [e+step..a+b-1]
  g <- [f+step..a+b-1]
  guard (a + b + c + d - e - f - g > 0)   --2nd: we screen for tuples that completely conform to rule 2
  let nn = [g,f,e,d,c,b,a]
  guard (sum nn < 285)                    --3rd: we screen for tuples of a certain size (a guess to speed things up)
  guard (rule1 nn)                        --4th: we screen for tuples that conform to rule 1
  return nn

rule1 :: [Int] -> Bool
rule1 nn = {-# SCC "rule1" #-} 
    null . filter ((>1) . snd)           --confirm that there's only one subgroup that sums to any given sum
  . filter ((length nn==) . snd . fst)   --the last column us how many subgroups sum to a given sum
  . assocs                               --run the dynamic programming algorithm and generate a table
  $ dynamic nn

dynamic :: [Int] -> Array (Int,Int) Int
dynamic ns = {-# SCC "dynamic" #-} table
  where
    (len, maxSum) = (length &&& sum) ns
    table = array ((0,0),(maxSum,len)) 
      [ ((s,i),x) | s <- [0..maxSum], i <- [0..len], let x = value (s,i) ]
    elements = listArray (0,len) (0:ns)
    value (s,i)
      | i == 0 || s == 0 = 0
      | s ==  m = table ! (s,i-1) + 1
      | s > m = s <= sum (take i ns) ?: 
          (table ! (s,i-1) + table ! ((s-m),i-1), 0)
      | otherwise = 0
      where
        m = elements ! i

Stats on heap allocation, garbage collection and time elapsed: 堆分配,垃圾收集和已用时间的统计信息:

% ghc -O2 --make 103_specialsubset2.hs -rtsopts -prof -auto-all -caf-all -fforce-recomp
[1 of 1] Compiling Main             ( 103_specialsubset2.hs, 103_specialsubset2.o )
Linking 103_specialsubset2 ...
% time ./103_specialsubset2.hs +RTS -p -sstderr
zsh: permission denied: ./103_specialsubset2.hs
./103_specialsubset2.hs +RTS -p -sstderr  0.00s user 0.00s system 86% cpu 0.002 total
% time ./103_specialsubset2 +RTS -p -sstderr
SOLUTION REDACTED
 172,449,596,840 bytes allocated in the heap
  21,738,677,624 bytes copied during GC
         261,128 bytes maximum residency (74 sample(s))
          55,464 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     327548 colls,     0 par   27.34s   41.64s     0.0001s    0.0092s
  Gen  1        74 colls,     0 par    0.02s    0.02s     0.0003s    0.0013s

  INIT    time    0.00s  (  0.01s elapsed)
  MUT     time   53.91s  ( 70.60s elapsed)
  GC      time   27.35s  ( 41.66s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   81.26s  (112.27s elapsed)

  %GC     time      33.7%  (37.1% elapsed)

  Alloc rate    3,199,123,974 bytes per MUT second

  Productivity  66.3% of total user, 48.0% of total elapsed

./103_specialsubset2 +RTS -p -sstderr  81.26s user 30.90s system 99% cpu 1:52.29 total

Stats on time spent per cost-centre: 每个成本中心花费的统计时间:

    Wed Dec 17 23:21 2014 Time and Allocation Profiling Report  (Final)

       103_specialsubset2 +RTS -p -sstderr -RTS

    total time  =       15.56 secs   (15565 ticks @ 1000 us, 1 processor)
    total alloc = 118,221,354,488 bytes  (excludes profiling overheads)

COST CENTRE     MODULE  %time %alloc

dynamic.value   Main     41.6   17.7
dynamic.table   Main     29.1   37.8
construct       Main     12.9   37.4
rule1           Main     12.4    7.0
dynamic.table.x Main      1.9    0.0


                                                                    individual     inherited
COST CENTRE               MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                      MAIN                     55           0    0.0    0.0   100.0  100.0
 main                     Main                    111           0    0.0    0.0     0.0    0.0
 CAF:main1                Main                    108           0    0.0    0.0     0.0    0.0
  main                    Main                    110           1    0.0    0.0     0.0    0.0
 CAF:main2                Main                    107           0    0.0    0.0     0.0    0.0
  main                    Main                    112           0    0.0    0.0     0.0    0.0
 CAF:main3                Main                    106           0    0.0    0.0     0.0    0.0
  main                    Main                    113           0    0.0    0.0     0.0    0.0
 CAF:construct            Main                    105           0    0.0    0.0   100.0  100.0
  construct               Main                    114           1    0.6    0.0   100.0  100.0
   construct              Main                    115           1   12.9   37.4    99.4  100.0
    rule1                 Main                    123      282235    0.6    0.0    86.5   62.6
     rule1                Main                    124      282235   12.4    7.0    85.9   62.6
      dynamic             Main                    125      282235    0.2    0.0    73.5   55.6
       dynamic.elements   Main                    133      282235    0.3    0.1     0.3    0.1
       dynamic.len        Main                    129      282235    0.0    0.0     0.0    0.0
       dynamic.table      Main                    128      282235   29.1   37.8    72.9   55.5
        dynamic.table.x   Main                    130   133204473    1.9    0.0    43.8   17.7
         dynamic.value    Main                    131   133204473   41.6   17.7    41.9   17.7
          dynamic.value.m Main                    132   132640003    0.3    0.0     0.3    0.0
       dynamic.maxSum     Main                    127      282235    0.0    0.0     0.0    0.0
       dynamic.(...)      Main                    126      282235    0.1    0.0     0.1    0.0
    dynamic               Main                    122      282235    0.0    0.0     0.0    0.0
    construct.nn          Main                    121    12683926    0.0    0.0     0.0    0.0
 CAF:main4                Main                    102           0    0.0    0.0     0.0    0.0
  construct               Main                    116           0    0.0    0.0     0.0    0.0
   construct              Main                    117           0    0.0    0.0     0.0    0.0
 CAF:cap                  Main                    101           0    0.0    0.0     0.0    0.0
  cap                     Main                    119           1    0.0    0.0     0.0    0.0
 CAF:flr                  Main                    100           0    0.0    0.0     0.0    0.0
  flr                     Main                    118           1    0.0    0.0     0.0    0.0
 CAF:step_r1dD            Main                     99           0    0.0    0.0     0.0    0.0
  step                    Main                    120           1    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Handle.FD         96           0    0.0    0.0     0.0    0.0
 CAF                      GHC.Conc.Signal          93           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding          91           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding.Iconv    82           0    0.0    0.0     0.0    0.0

Heap profile: 堆配置文件:

堆配置文件

Allocation by type: 按类型分配:

在此输入图像描述

Allocation by constructors: 由构造函数分配:

TBD

There is a lot that can be said. 有很多可以说的。 In this answer I'll just comment on the nested list comprehensions in the construct function. 在这个答案中,我将仅对construct函数中的嵌套列表推导进行评论。

To get an idea on what's going on in construct we'll isolate it and compare it to a nested loop version that you would write in an imperative language. 为了了解construct发生的事情,我们将其隔离并将其与您用命令式语言编写的嵌套循环版本进行比较。 We've removed the rule1 guard to test only the generation of lists. 我们删除了rule1 guard以仅测试列表的生成。

-- List.hs -- using list comprehensions

import Control.Monad

cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int

construct :: [[Int]]
construct =  do
  a <- [flr..cap]                         
  b <- [a+step..cap]                      
  c <- [b+step..a+b-1]
  d <- [c+step..a+b-1]
  e <- [d+step..a+b-1]
  f <- [e+step..a+b-1]
  g <- [f+step..a+b-1]
  guard (a + b + c + d - e - f - g > 0)
  guard (a + b + c + d + e + f + g < 285)
  return  [g,f,e,d,c,b,a]
  -- guard (rule1 nn)

main = do
  forM_ construct print


-- Loops.hs -- using imperative looping

import Control.Monad

loop a b f = go a
  where go i | i > b     = return ()
             | otherwise = do f i; go (i+1)

cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int

main =
  loop flr cap $ \a ->
  loop (a+step) cap $ \b ->
  loop (b+step) (a+b-1) $ \c ->
  loop (c+step) (a+b-1) $ \d ->
  loop (d+step) (a+b-1) $ \e ->
  loop (e+step) (a+b-1) $ \f ->
  loop (f+step) (a+b-1) $ \g ->
    if (a+b+c+d-e-f-g > 0) && (a+b+c+d+e+f+g < 285)
      then print [g,f,e,d,c,b,a]
      else return ()

Both programs were compiled with ghc -O2 -rtsopts and run with prog +RTS -s > out . 两个程序都使用ghc -O2 -rtsopts编译,并使用prog +RTS -s > out

Here is a summary of the results: 以下是结果摘要:

                          Lists.hs    Loops.hs
  Heap allocation        44,913 MB    2,740 MB
  Max. Residency            44,312      44,312
  %GC                        5.8 %       1.7 %
  Total Time             9.48 secs   1.43 secs

As you can see, the loop version, which is the way you would write this in a language like C, wins in every category. 正如您所看到的,循环版本,即用C语言编写的方式,在每个类别中都会获胜。

The list comprehension version is cleaner and more composable but also less performant than direct iteration. 列表推导版本更清晰,更易于组合,但性能也不如直接迭代。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM