[英]Poor performance parsing binary file in haskell
我有一組打包到文件中的二進制記錄,我正在使用Data.ByteString.Lazy和Data.Binary.Get讀取它們。 使用我當前的實現,8Mb文件需要6秒才能解析。
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
data Trade = Trade { timestamp :: Int, price :: Int , qty :: Int } deriving (Show)
getTrades = do
empty <- isEmpty
if empty
then return []
else do
timestamp <- getWord32le
price <- getWord32le
qty <- getWord16le
rest <- getTrades
let trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
return (trade : rest)
main :: IO()
main = do
input <- BL.readFile "trades.bin"
let trades = runGet getTrades input
print $ length trades
我能做些什么來加快速度?
稍微重構它(基本上是左折)可以提供更好的性能並降低GC開銷,相當多地解析一個8388600字節文件。
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
data Trade = Trade
{ timestamp :: {-# UNPACK #-} !Int
, price :: {-# UNPACK #-} !Int
, qty :: {-# UNPACK #-} !Int
} deriving (Show)
getTrade :: Get Trade
getTrade = do
timestamp <- getWord32le
price <- getWord32le
qty <- getWord16le
return $! Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
countTrades :: BL.ByteString -> Int
countTrades input = stepper (0, input) where
stepper (!count, !buffer)
| BL.null buffer = count
| otherwise =
let (trade, rest, _) = runGetState getTrade buffer 0
in stepper (count+1, rest)
main :: IO()
main = do
input <- BL.readFile "trades.bin"
let trades = countTrades input
print trades
以及相關的運行時統計信息。 即使分配編號接近,GC和最大堆大小在修訂版之間也有很大差異。
這里的所有例子都是用GHC 7.4.1 -O2構建的。
由於堆棧空間使用過多,原始源使用+ RTS -K1G -RTS運行:
426,003,680 bytes allocated in the heap 443,141,672 bytes copied during GC 99,305,920 bytes maximum residency (9 sample(s)) 203 MB total memory in use (0 MB lost due to fragmentation) Total time 0.62s ( 0.81s elapsed) %GC time 83.3% (86.4% elapsed)
丹尼爾的修訂:
357,851,536 bytes allocated in the heap 220,009,088 bytes copied during GC 40,846,168 bytes maximum residency (8 sample(s)) 85 MB total memory in use (0 MB lost due to fragmentation) Total time 0.24s ( 0.28s elapsed) %GC time 69.1% (71.4% elapsed)
這篇文章:
290,725,952 bytes allocated in the heap 109,592 bytes copied during GC 78,704 bytes maximum residency (10 sample(s)) 2 MB total memory in use (0 MB lost due to fragmentation) Total time 0.06s ( 0.07s elapsed) %GC time 5.0% (6.0% elapsed)
你的代碼在不到一秒的時間內解碼了一個8MB的文件(ghc-7.4.1) - 我當然是用-O2
編譯的。
但是,它需要過多的堆棧空間。 你可以減少
需要通過在適當的位置添加更嚴格的內容,並使用累加器來收集解析到目前為止的交易。
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
data Trade = Trade { timestamp :: {-# UNPACK #-} !Int
, price :: {-# UNPACK #-} !Int
, qty :: {-# UNPACK #-} !Int
} deriving (Show)
getTrades :: Get [Trade]
getTrades = go []
where
go !acc = do
empty <- isEmpty
if empty
then return $! reverse acc
else do
!timestamp <- getWord32le
!price <- getWord32le
!qty <- getWord16le
let !trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
go (trade : acc)
main :: IO()
main = do
input <- BL.readFile "trades.bin"
let trades = runGet getTrades input
print $ length trades
嚴格和解包確保沒有任何工作可以通過引用應該已經忘記的ByteString
的一部分來回來咬你。
如果您需要Trade
以具有惰性字段,您仍然可以通過具有嚴格字段的類型進行解碼,並將轉換map
到結果列表以從更嚴格的解碼中受益。
但是,代碼仍然花費大量時間進行垃圾收集,因此可能仍需要進一步改進。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.