将stdin中的数字读入Data.Vector.Unboxed.Vector Int64

Question

Given is a text file (for piping) with many numbers divided by a space, like so: 给定是一个文本文件（用于管道），其中许多数字除以空格，如下所示：

234 456 345 ...

What is the best way to read them all into a Data.Vector.Unboxed.Vector Int64 ? 将它们全部读入Data.Vector.Unboxed.Vector Int64的最佳方法是什么？ My current code looks like this: 我当前的代码如下所示：

import Control.Applicative
import Control.Arrow
import Data.Int
import Data.Maybe
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector.Unboxed as V

main :: IO ()
main = do
    v <- readInts <$> B.getContents
    print $ V.maximum v

-- splitted for profiling
readInts :: B.ByteString -> V.Vector Int64
readInts = a >>> b >>> c >>> d

a = B.split ' '
b = mapMaybe (B.readInt >>> liftA fst)
c = map fromIntegral
d = V.fromList

Here is the profiler output 这是分析器输出

    Thu Sep 18 16:08 2014 Time and Allocation Profiling Report  (Final)

       FastReadInts +RTS -p -K800M -RTS

    total time  =        0.51 secs   (505 ticks @ 1000 us, 1 processor)
    total alloc = 1,295,988,256 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

d           Main     74.3    5.2
b           Main      9.9   35.6
a           Main      6.3   40.0
main        Main      4.8    0.0
c           Main      3.2   19.3


                                                        individual     inherited
COST CENTRE MODULE                    no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                       60           0    0.4    0.0   100.0  100.0
 main       Main                      121           0    4.8    0.0    98.2  100.0
  readInts  Main                      123           0    0.0    0.0    93.5  100.0
   a        Main                      131           0    6.1   40.0     6.1   40.0
   b        Main                      129           0    9.9   35.6     9.9   35.6
   c        Main                      127           0    3.2   19.3     3.2   19.3
   d        Main                      125           0   74.3    5.2    74.3    5.2
 CAF        Main                      119           0    0.0    0.0     0.2    0.0
  a         Main                      130           1    0.2    0.0     0.2    0.0
  b         Main                      128           1    0.0    0.0     0.0    0.0
  c         Main                      126           1    0.0    0.0     0.0    0.0
  d         Main                      124           1    0.0    0.0     0.0    0.0
  readInts  Main                      122           1    0.0    0.0     0.0    0.0
  main      Main                      120           1    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Handle.FD          103           0    0.6    0.0     0.6    0.0
 CAF        GHC.IO.Encoding            96           0    0.2    0.0     0.2    0.0
 CAF        GHC.IO.Handle.Internals    93           0    0.0    0.0     0.0    0.0
 CAF        GHC.Conc.Signal            83           0    0.2    0.0     0.2    0.0
 CAF        GHC.IO.Encoding.Iconv      81           0    0.2    0.0     0.2    0.0

The programm is compiled and run this way: 程序编译并以这种方式运行：

ghc -O2 -prof -auto-all -rtsopts FastReadInts.hs
./FastReadInts +RTS -p -K800M < many_numbers.txt

many_numbers.txt is about 14MB large. many_numbers.txt大约14MB。

How can this bottleneck, ie V.fromList , be removed? 如何去除这个瓶颈，即V.fromList ？

Answer 1

It is hard to answer questions like this without some expected level of performance or point of comparison. 没有一些预期的绩效水平或比较点，很难回答这样的问题。 By simply omitting the profiling your code runs in 100ms over an ASCii file of 21MB of random 64-bit numbers, this seems reasonable to me. 通过简单地省略你的代码在一个21MB的随机64位数字的ASCii文件上运行100ms的分析，这对我来说似乎是合理的。

$ time ./so < randoms.txt 
9223350746261547498

real    0m0.109s
user    0m0.094s
sys     0m0.013s

And the generation of the test data: 并生成测试数据：

import System.Random

main = do
    g <- newStdGen
    let rs = take (2^20) $ randomRs (0,2^64) g :: [Integer]
    writeFile "randoms.txt" $ unwords (map show rs)

EDIT: 编辑：

As requested: 按照要求：

import Data.Vector.Unboxed.Mutable as M
...
listToVector :: [Int64] -> V.Vector Int64
listToVector ls = unsafePerformIO $ do
        m <- M.unsafeNew (2^20)
        zipWithM_ (M.unsafeWrite m) [0..(2^20)-1] ls
        V.unsafeFreeze m

Answer 2

Just wanted to note that pre-allocating mutable vector does not impact performance too much. 只是想要注意，预分配可变向量不会对性能产生太大影响。 In most cases run time will be dominated by reading file. 在大多数情况下，运行时间将由读取文件占主导地位。

I have benchmarked both versions on 2^23 numbers and it seems that pre-allocated mutable array is even a bit slower. 我已经对2^23数字的两个版本进行了基准测试，看起来预先分配的可变数组甚至有点慢。

benchmarking V.fromList
time                 49.51 ms   (47.65 ms .. 51.07 ms)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 48.24 ms   (47.82 ms .. 49.01 ms)
std dev              971.5 μs   (329.1 μs .. 1.438 ms)

benchmarking listToVector
time                 109.9 ms   (106.2 ms .. 119.9 ms)
                     0.993 R²   (0.975 R² .. 1.000 R²)
mean                 109.3 ms   (107.6 ms .. 113.8 ms)
std dev              4.041 ms   (1.149 ms .. 6.129 ms)

And here is the code of the benchmark: 以下是基准测试的代码：

import Control.Applicative
import Control.Monad (zipWithM_)
import System.IO.Unsafe
import Data.Int
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as M

import Criterion.Main


main :: IO ()
main = do
  let readInt x = let Just (i,_) = B.readInt x in fromIntegral i
  nums <- map readInt . B.words <$> B.readFile "randoms.txt"

  defaultMain
    [bench "V.fromList"   $ whnf (V.maximum . V.fromList) nums
    ,bench "listToVector" $ whnf (V.maximum . listToVector) nums
    ]

listToVector :: [Int64] -> V.Vector Int64
listToVector ls = unsafePerformIO $ do
    m <- M.unsafeNew (2^23)
    zipWithM_ (M.unsafeWrite m) [0..(2^23)-1] ls
    V.unsafeFreeze m

将stdin中的数字读入Data.Vector.Unboxed.Vector Int64

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-09-18 18:14:36

解决方案2
1 2014-10-10 08:09:24

将stdin中的数字读入Data.Vector.Unboxed.Vector Int64

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-09-18 18:14:36

解决方案2 1 2014-10-10 08:09:24

解决方案1
2 已采纳 2014-09-18 18:14:36

解决方案2
1 2014-10-10 08:09:24