堆在Haskell溢出

Question

I'm getting Heap exhausted message when running the following short Haskell program on a big enough dataset. 在足够大的数据集上运行以下简短的Haskell程序时，我收到了Heap exhausted消息。 For example, the program fails (with heap overflow) on 20 Mb input file with around 900k lines. 例如，程序在20 Mb输入文件上失败（堆溢出），大约900k行。 The heap size was set (through -with-rtsopts ) to 1 Gb. 堆大小已设置（通过-with-rtsopts ）为1 Gb。 It runs ok if longestCommonSubstrB is defined as something simpler, eg commonPrefix . 如果longestCommonSubstrB被定义为更简单的东西，例如commonPrefix ，它运行正常。 I need to process files in the order of 100 Mb. 我需要处理大约100 Mb的文件。

I compiled the program with the following command line (GHC 7.8.3): 我使用以下命令行编译程序（GHC 7.8.3）：

ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs

I would appreciate any help in making this thing run in a reasonable amount of space (in the order of the input file size), but I would especially appreciate the thought process of finding where the bottleneck is and where and how to force the strictness. 我希望任何帮助使这个东西在合理的空间（按输入文件大小的顺序）运行，但我特别欣赏找到瓶颈所在的思维过程以及在何处以及如何强制严格。

My guess is that somehow forcing longestCommonSubstrB function to evaluate strictly would solve the problem, but I don't know how to do that. 我的猜测是，以某种方式迫使longestCommonSubstrB函数严格评估会解决问题，但我不知道该怎么做。

{-# LANGUAGE BangPatterns #-}
module Main where 
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List (maximumBy, sort)
import Data.Function (on)
import Data.Char (isSpace)

-- | Returns a list of lexicon items, i.e. [[w1,w2,w3]]
readLexicon :: FilePath -> IO [[B.ByteString]]
readLexicon filename = do 
    text <- B.readFile filename
    return $ map (B.split '\t' . stripR) . B.lines $ text
    where
        stripR = B.reverse . B.dropWhile isSpace . B.reverse

transformOne :: [B.ByteString] -> B.ByteString
transformOne (w1:w2:w3:[]) = 
    B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3]
transformOne a = error $ "transformOne: unexpected tuple " ++ show a

longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString
longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $ 
    [f xs' ys | xs' <- B.tails xs] ++
    [f xs ys' | ys' <- tail $ B.tails ys]
  where f xs' ys' = scanl g B.empty $ B.zip xs' ys'
        g z (x, y) = if x == y 
                then z `B.snoc` x
                else B.empty

main :: IO ()
main = do
    (input:output:_) <- getArgs
    lexicon <- readLexicon input
    let flattened = B.unlines . sort . map transformOne $ lexicon
    B.writeFile output flattened

This is the profile ouput for the test dataset (100k lines, heap size set to 1 GB, ie generateSample.exe 100000 , the resulting file size is 2.38 MB): 这是测试数据集的配置文件输出（100k行，堆大小设置为1 GB，即generateSample.exe 100000 ，生成的文件大小为2.38 MB）：

Heap profile over time: 随时间变化的堆配置文件：

Execution statistics: 执行统计：

   3,505,737,588 bytes allocated in the heap
     785,283,180 bytes copied during GC
      62,390,372 bytes maximum residency (44 sample(s))
         216,592 bytes maximum slop
              96 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6697 colls,     0 par    1.05s    1.03s     0.0002s    0.0013s
  Gen  1        44 colls,     0 par    4.14s    3.99s     0.0906s    0.1935s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    7.80s  (  9.17s elapsed)
  GC      time    3.75s  (  3.67s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    1.44s  (  1.35s elapsed)
  EXIT    time    0.02s  (  0.00s elapsed)
  Total   time   13.02s  ( 12.85s elapsed)

  %GC     time      28.8%  (28.6% elapsed)

  Alloc rate    449,633,678 bytes per MUT second

  Productivity  60.1% of total user, 60.9% of total elapsed

Time and Allocation Profiling Report: 时间和分配情况分析报告：

       SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt

    total time  =        3.97 secs   (3967 ticks @ 1000 us, 1 processor)
    total alloc = 2,321,595,564 bytes  (excludes profiling overheads)

COST CENTRE            MODULE  %time %alloc

longestCommonSubstrB   Main     43.3   33.1
longestCommonSubstrB.f Main     21.5   43.6
main.flattened         Main     17.5    5.1
main                   Main      6.6    5.8
longestCommonSubstrB.g Main      5.0    5.8
readLexicon            Main      2.5    2.8
transformOne           Main      1.8    1.7
readLexicon.stripR     Main      1.8    1.9


                                                                            individual     inherited
COST CENTRE                  MODULE                       no.     entries  %time %alloc   %time %alloc

MAIN                         MAIN                          45           0    0.1    0.0   100.0  100.0
 main                        Main                          91           0    6.6    5.8    99.9  100.0
  main.flattened             Main                          93           1   17.5    5.1    89.1   89.4
   transformOne              Main                          95      100000    1.8    1.7    71.6   84.3
    longestCommonSubstrB     Main                         100      100000   43.3   33.1    69.8   82.5
     longestCommonSubstrB.f  Main                         101     1400000   21.5   43.6    26.5   49.5
      longestCommonSubstrB.g Main                         104     4200000    5.0    5.8     5.0    5.8
  readLexicon                Main                          92           1    2.5    2.8     4.2    4.8
   readLexicon.stripR        Main                          98           0    1.8    1.9     1.8    1.9
 CAF                         GHC.IO.Encoding.CodePage      80           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.Encoding               74           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.FD                     70           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.Handle.FD              66           0    0.0    0.0     0.0    0.0
 CAF                         System.Environment            65           0    0.0    0.0     0.0    0.0
 CAF                         Data.ByteString.Lazy.Char8    54           0    0.0    0.0     0.0    0.0
 CAF                         Main                          52           0    0.0    0.0     0.0    0.0
  transformOne               Main                          99           0    0.0    0.0     0.0    0.0
  readLexicon                Main                          96           0    0.0    0.0     0.0    0.0
   readLexicon.stripR        Main                          97           1    0.0    0.0     0.0    0.0
  main                       Main                          90           1    0.0    0.0     0.0    0.0

UPDATE: The following program can be used to generate sample data. 更新：以下程序可用于生成样本数据。 It expects one argument, the number of lines in the generated dataset. 它需要一个参数，即生成的数据集中的行数。 The generated data will be saved to the sample.txt file. 生成的数据将保存到sample.txt文件中。 When I generate 900k lines dataset with it (by running generateSample.exe 900000 ), the produced dataset makes the above program fail with heap overflow (the heap size was set to 1 GB). 当我用它生成900k行数据集时（通过运行generateSample.exe 900000 ），生成的数据集使上述程序失败并发生堆溢出（堆大小设置为1 GB）。 The resulting dataset is around 20 MB. 生成的数据集大约为20 MB。

module Main where 
import System.Environment (getArgs)
import Data.List (intercalate, permutations)

generate :: Int -> [(String,String,String)]
generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki")
    where
        f = cycle . permutations

main :: IO ()
main = do
    (n:_) <- getArgs
    let flattened = unlines . map f $ generate (read n :: Int)
    writeFile "sample.txt" flattened 
    where
        f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]

Answer 1

It seems to me you've implemented a naive longest common substring, with terrible space complexity (at least O(n^2)). 在我看来，你已经实现了一个天真的最长公共子串，具有可怕的空间复杂性（至少O（n ^ 2））。 Strictness has nothing to do with it. 严格与它无关。

You'll want to implement a dynamic programming algo. 您将需要实现动态编程算法。 You may find inspiration in the string-similarity package, or in the lcs function in the guts of the Diff package. 您可以在字符串相似性包中找到灵感，或者在Diff包的内容中的lcs函数中找到灵感。

堆在Haskell溢出

问题描述

1 个解决方案

解决方案1
0 2016-03-19 05:00:44

堆在Haskell溢出

问题描述

1 个解决方案

解决方案1 0 2016-03-19 05:00:44

解决方案1
0 2016-03-19 05:00:44