在haskell中讀取大文件？

Question

我一直在嘗試讀取haskell中的大文件。

我需要使用自定義算法為大學項目壓縮它。 一切正常，直到我開始壓縮大文件。

我從我的程序中提取出錯了，我在這里以“Hello大文件”的形式公開它：

import System
import qualified Data.ByteString.Lazy as BL
import Data.Word

fold_tailrec :: (a -> b -> a) -> a -> [b] -> a
fold_tailrec _ acc [] =
    acc
fold_tailrec foldFun acc (x : xs) =
    fold_tailrec foldFun (foldFun acc x) xs

fold_tailrec' :: (a -> b -> a) -> a -> [b] -> a
fold_tailrec' _ acc [] =
    acc
fold_tailrec' foldFun acc (x : xs) =
    let forceEval = fold_tailrec' foldFun (foldFun acc x) xs in
    seq forceEval forceEval

main :: IO ()
main =
    do
        args <- System.getArgs
        let filename = head args
        byteString <- BL.readFile filename
        let wordsList = BL.unpack byteString
        -- wordsList is supposed to be lazy (bufferized)
        let bytesCount = fold_tailrec (\acc word -> acc + 1) 0 wordsList
        print ("Total bytes in " ++ filename ++ ": " 
               ++ (show bytesCount))

我將此文件命名為Test.hs，然后執行以下操作：

$ ls -l toto
-rwxrwxrwx 1 root root 5455108 2011-03-23 19:08 toto
$ ghc --make -O Test.hs
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...
$ ./Test toto
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
$ ./Test toto +RTS -K50M -RTS
Stack space overflow: current size 50000000 bytes.
Use `+RTS -Ksize -RTS' to increase it.
$ ./Test toto +RTS -K500M -RTS
"Total bytes in toto: 5455108"
$ time ./Test toto +RTS -K500M -RTS
"Total bytes in toto: 5455108"

real    0m33.453s
user    0m8.917s
sys 0m10.433s

任何人都可以解釋為什么我需要500兆字節的RAM和30秒的CPU才能瀏覽一個可憐的5兆字節文件？ 請問我做錯了什么？ 為什么不將[word8]緩沖為ByteString文檔說明。 以及如何解決這個問題？

我試圖定義自己的尾遞歸折疊而不是foldl，foldr或foldl'。 我嘗試用seq來解凍thunk。 到目前為止我沒有結果。

感謝任何幫助，因為我被困住了。

Answer 1

構造“seq xx”總是無用的。 如果y = seq xx並且我強制y則強制x然后返回x。 這相當於y = x並強制y。 因此“seq forceEval forceEval”只能執行“forceEval”。

使用折疊的錯誤是常見的。

您正在使用折疊來執行輸入中的字節計數。 你應該使用一個嚴格的左折疊這樣的總和，但你的手寫折疊是一個懶惰的左折疊。 （acc + 1）沒有得到評估，因此它構建了500萬個嵌套應用程序：（（（...（0 + 1）+1）+1）+ 1）+1）+1）... + 1 ）。 然后在打印時強制它，評估試圖下降到500萬個括號。

因此，掛起的堆棧為每個Word8都有一個條目。 對於短輸入，它到達終點並看到0.對於長輸入，它用盡GHC的堆棧空間，因為GHC的創建者和大多數用戶認為嘗試分配500萬個堆棧幀通常是程序員的設計錯誤。

我預測你可以使用“seq”來解決這個問題：

fold_tailrec' foldFun acc (x : xs) =
    let acc' = foldFun acc x
    in seq acc' (fold_tailrec' foldFun acc' xs)

在haskell中讀取大文件？

問題描述

1 個解決方案

解決方案1
34 已采納 2011-03-23 20:19:57

在haskell中讀取大文件？

問題描述

1 個解決方案

解決方案1 34 已采納 2011-03-23 20:19:57

解決方案1
34 已采納 2011-03-23 20:19:57