[英]Heap overflow in Haskell
I'm getting Heap exhausted
message when running the following short Haskell program on a big enough dataset. 在足够大的数据集上运行以下简短的Haskell程序时,我收到了
Heap exhausted
消息。 For example, the program fails (with heap overflow) on 20 Mb input file with around 900k lines. 例如,程序在20 Mb输入文件上失败(堆溢出),大约900k行。 The heap size was set (through
-with-rtsopts
) to 1 Gb. 堆大小已设置(通过
-with-rtsopts
)为1 Gb。 It runs ok if longestCommonSubstrB
is defined as something simpler, eg commonPrefix
. 如果
longestCommonSubstrB
被定义为更简单的东西,例如commonPrefix
,它运行正常。 I need to process files in the order of 100 Mb. 我需要处理大约100 Mb的文件。
I compiled the program with the following command line (GHC 7.8.3): 我使用以下命令行编译程序(GHC 7.8.3):
ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs
I would appreciate any help in making this thing run in a reasonable amount of space (in the order of the input file size), but I would especially appreciate the thought process of finding where the bottleneck is and where and how to force the strictness. 我希望任何帮助使这个东西在合理的空间(按输入文件大小的顺序)运行,但我特别欣赏找到瓶颈所在的思维过程以及在何处以及如何强制严格。
My guess is that somehow forcing longestCommonSubstrB
function to evaluate strictly would solve the problem, but I don't know how to do that. 我的猜测是,以某种方式迫使
longestCommonSubstrB
函数严格评估会解决问题,但我不知道该怎么做。
{-# LANGUAGE BangPatterns #-}
module Main where
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List (maximumBy, sort)
import Data.Function (on)
import Data.Char (isSpace)
-- | Returns a list of lexicon items, i.e. [[w1,w2,w3]]
readLexicon :: FilePath -> IO [[B.ByteString]]
readLexicon filename = do
text <- B.readFile filename
return $ map (B.split '\t' . stripR) . B.lines $ text
where
stripR = B.reverse . B.dropWhile isSpace . B.reverse
transformOne :: [B.ByteString] -> B.ByteString
transformOne (w1:w2:w3:[]) =
B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3]
transformOne a = error $ "transformOne: unexpected tuple " ++ show a
longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString
longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $
[f xs' ys | xs' <- B.tails xs] ++
[f xs ys' | ys' <- tail $ B.tails ys]
where f xs' ys' = scanl g B.empty $ B.zip xs' ys'
g z (x, y) = if x == y
then z `B.snoc` x
else B.empty
main :: IO ()
main = do
(input:output:_) <- getArgs
lexicon <- readLexicon input
let flattened = B.unlines . sort . map transformOne $ lexicon
B.writeFile output flattened
This is the profile ouput for the test dataset (100k lines, heap size set to 1 GB, ie generateSample.exe 100000
, the resulting file size is 2.38 MB): 这是测试数据集的配置文件输出(100k行,堆大小设置为1 GB,即
generateSample.exe 100000
,生成的文件大小为2.38 MB):
Heap profile over time: 随时间变化的堆配置文件:
Execution statistics: 执行统计:
3,505,737,588 bytes allocated in the heap
785,283,180 bytes copied during GC
62,390,372 bytes maximum residency (44 sample(s))
216,592 bytes maximum slop
96 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6697 colls, 0 par 1.05s 1.03s 0.0002s 0.0013s
Gen 1 44 colls, 0 par 4.14s 3.99s 0.0906s 0.1935s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.80s ( 9.17s elapsed)
GC time 3.75s ( 3.67s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 1.44s ( 1.35s elapsed)
EXIT time 0.02s ( 0.00s elapsed)
Total time 13.02s ( 12.85s elapsed)
%GC time 28.8% (28.6% elapsed)
Alloc rate 449,633,678 bytes per MUT second
Productivity 60.1% of total user, 60.9% of total elapsed
Time and Allocation Profiling Report: 时间和分配情况分析报告:
SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt
total time = 3.97 secs (3967 ticks @ 1000 us, 1 processor)
total alloc = 2,321,595,564 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
longestCommonSubstrB Main 43.3 33.1
longestCommonSubstrB.f Main 21.5 43.6
main.flattened Main 17.5 5.1
main Main 6.6 5.8
longestCommonSubstrB.g Main 5.0 5.8
readLexicon Main 2.5 2.8
transformOne Main 1.8 1.7
readLexicon.stripR Main 1.8 1.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 45 0 0.1 0.0 100.0 100.0
main Main 91 0 6.6 5.8 99.9 100.0
main.flattened Main 93 1 17.5 5.1 89.1 89.4
transformOne Main 95 100000 1.8 1.7 71.6 84.3
longestCommonSubstrB Main 100 100000 43.3 33.1 69.8 82.5
longestCommonSubstrB.f Main 101 1400000 21.5 43.6 26.5 49.5
longestCommonSubstrB.g Main 104 4200000 5.0 5.8 5.0 5.8
readLexicon Main 92 1 2.5 2.8 4.2 4.8
readLexicon.stripR Main 98 0 1.8 1.9 1.8 1.9
CAF GHC.IO.Encoding.CodePage 80 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 74 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 70 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 66 0 0.0 0.0 0.0 0.0
CAF System.Environment 65 0 0.0 0.0 0.0 0.0
CAF Data.ByteString.Lazy.Char8 54 0 0.0 0.0 0.0 0.0
CAF Main 52 0 0.0 0.0 0.0 0.0
transformOne Main 99 0 0.0 0.0 0.0 0.0
readLexicon Main 96 0 0.0 0.0 0.0 0.0
readLexicon.stripR Main 97 1 0.0 0.0 0.0 0.0
main Main 90 1 0.0 0.0 0.0 0.0
UPDATE: The following program can be used to generate sample data. 更新:以下程序可用于生成样本数据。 It expects one argument, the number of lines in the generated dataset.
它需要一个参数,即生成的数据集中的行数。 The generated data will be saved to the
sample.txt
file. 生成的数据将保存到
sample.txt
文件中。 When I generate 900k lines dataset with it (by running generateSample.exe 900000
), the produced dataset makes the above program fail with heap overflow (the heap size was set to 1 GB). 当我用它生成900k行数据集时(通过运行
generateSample.exe 900000
),生成的数据集使上述程序失败并发生堆溢出(堆大小设置为1 GB)。 The resulting dataset is around 20 MB. 生成的数据集大约为20 MB。
module Main where
import System.Environment (getArgs)
import Data.List (intercalate, permutations)
generate :: Int -> [(String,String,String)]
generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki")
where
f = cycle . permutations
main :: IO ()
main = do
(n:_) <- getArgs
let flattened = unlines . map f $ generate (read n :: Int)
writeFile "sample.txt" flattened
where
f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]
It seems to me you've implemented a naive longest common substring, with terrible space complexity (at least O(n^2)). 在我看来,你已经实现了一个天真的最长公共子串,具有可怕的空间复杂性(至少O(n ^ 2))。 Strictness has nothing to do with it.
严格与它无关。
You'll want to implement a dynamic programming algo. 您将需要实现动态编程算法。 You may find inspiration in the string-similarity package, or in the lcs function in the guts of the Diff package.
您可以在字符串相似性包中找到灵感,或者在Diff包的内容中的lcs函数中找到灵感。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.