在haskell中解析二進制文件的性能很差

Question

我有一組打包到文件中的二進制記錄，我正在使用Data.ByteString.Lazy和Data.Binary.Get讀取它們。 使用我當前的實現，8Mb文件需要6秒才能解析。

import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get

data Trade = Trade { timestamp :: Int, price :: Int ,  qty :: Int } deriving (Show)

getTrades = do
  empty <- isEmpty
  if empty
    then return []
    else do
      timestamp <- getWord32le          
      price <- getWord32le
      qty <- getWord16le          
      rest <- getTrades
      let trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
      return (trade : rest)

main :: IO()
main = do
  input <- BL.readFile "trades.bin" 
  let trades = runGet getTrades input
  print $ length trades

我能做些什么來加快速度？

Answer 1

稍微重構它（基本上是左折）可以提供更好的性能並降低GC開銷，相當多地解析一個8388600字節文件。

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get

data Trade = Trade
  { timestamp :: {-# UNPACK #-} !Int
  , price     :: {-# UNPACK #-} !Int 
  , qty       :: {-# UNPACK #-} !Int
  } deriving (Show)

getTrade :: Get Trade
getTrade = do
  timestamp <- getWord32le
  price     <- getWord32le
  qty       <- getWord16le
  return $! Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)

countTrades :: BL.ByteString -> Int
countTrades input = stepper (0, input) where
  stepper (!count, !buffer)
    | BL.null buffer = count
    | otherwise      =
        let (trade, rest, _) = runGetState getTrade buffer 0
        in stepper (count+1, rest)

main :: IO()
main = do
  input <- BL.readFile "trades.bin"
  let trades = countTrades input
  print trades

以及相關的運行時統計信息。 即使分配編號接近，GC和最大堆大小在修訂版之間也有很大差異。

這里的所有例子都是用GHC 7.4.1 -O2構建的。

由於堆棧空間使用過多，原始源使用+ RTS -K1G -RTS運行：

426,003,680 bytes allocated in the heap
     443,141,672 bytes copied during GC
      99,305,920 bytes maximum residency (9 sample(s))
             203 MB total memory in use (0 MB lost due to fragmentation)

  Total   time    0.62s  (  0.81s elapsed)

  %GC     time      83.3%  (86.4% elapsed)

丹尼爾的修訂：

357,851,536 bytes allocated in the heap
     220,009,088 bytes copied during GC
      40,846,168 bytes maximum residency (8 sample(s))
              85 MB total memory in use (0 MB lost due to fragmentation)

  Total   time    0.24s  (  0.28s elapsed)

  %GC     time      69.1%  (71.4% elapsed)

這篇文章：

290,725,952 bytes allocated in the heap
         109,592 bytes copied during GC
          78,704 bytes maximum residency (10 sample(s))
               2 MB total memory in use (0 MB lost due to fragmentation)

  Total   time    0.06s  (  0.07s elapsed)

  %GC     time       5.0%  (6.0% elapsed)

Answer 2

你的代碼在不到一秒的時間內解碼了一個8MB的文件（ghc-7.4.1） - 我當然是用-O2編譯的。

但是，它需要過多的堆棧空間。 你可以減少

時間
堆棧空間
堆空間

需要通過在適當的位置添加更嚴格的內容，並使用累加器來收集解析到目前為止的交易。

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get

data Trade = Trade { timestamp :: {-# UNPACK #-} !Int
                   , price :: {-# UNPACK #-} !Int 
                   , qty :: {-# UNPACK #-} !Int
                   } deriving (Show)

getTrades :: Get [Trade]
getTrades = go []
  where
    go !acc = do
      empty <- isEmpty
      if empty
        then return $! reverse acc
        else do
          !timestamp <- getWord32le
          !price <- getWord32le
          !qty <- getWord16le
          let !trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
          go (trade : acc)

main :: IO()
main = do
  input <- BL.readFile "trades.bin"
  let trades = runGet getTrades input
  print $ length trades

嚴格和解包確保沒有任何工作可以通過引用應該已經忘記的ByteString的一部分來回來咬你。

如果您需要Trade以具有惰性字段，您仍然可以通過具有嚴格字段的類型進行解碼，並將轉換map到結果列表以從更嚴格的解碼中受益。

但是，代碼仍然花費大量時間進行垃圾收集，因此可能仍需要進一步改進。

在haskell中解析二進制文件的性能很差

問題描述

2 個解決方案

解決方案1
20 2012-03-05 20:40:57

解決方案2
17 已采納 2012-03-05 14:54:22

在haskell中解析二進制文件的性能很差

問題描述

2 個解決方案

解決方案1 20 2012-03-05 20:40:57

解決方案2 17 已采納 2012-03-05 14:54:22

解決方案1
20 2012-03-05 20:40:57

解決方案2
17 已采納 2012-03-05 14:54:22