简体   繁体   English

Haskell惰性I / O和关闭文件

[英]Haskell lazy I/O and closing files

I've written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). 我写了一个小的Haskell程序来打印当前目录中所有文件的MD5校验和(递归搜索)。 Basically a Haskell version of md5deep . 基本上是md5deep的Haskell版本。 All is fine and dandy except if the current directory has a very large number of files, in which case I get an error like: 一切正常,但除非当前目录中有大量文件,否则我将收到如下错误:

<program>: <currentFile>: openBinaryFile: resource exhausted (Too many open files)

It seems Haskell's laziness is causing it not to close files, even after its corresponding line of output has been completed. 似乎Haskell的懒惰导致它不关闭文件,即使在其相应的输出行完成之后也是如此。

The relevant code is below. 相关代码如下。 The function of interest is getList . 感兴趣的功能是getList

import qualified Data.ByteString.Lazy as BS

main :: IO ()
main = putStr . unlines =<< getList "."

getList :: FilePath -> IO [String]
getList p =
    let getFileLine path = liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
    in mapM getFileLine =<< getRecursiveContents p

hex :: [Word8] -> String
hex = concatMap (\x -> printf "%0.2x" (toInteger x))

getRecursiveContents :: FilePath -> IO [FilePath]
-- ^ Just gets the paths to all the files in the given directory.

Are there any ideas on how I could solve this problem? 关于如何解决这个问题有什么想法吗?

The entire program is available here: http://haskell.pastebin.com/PAZm0Dcb 整个程序可以在这里找到: http : //haskell.pastebin.com/PAZm0Dcb

Edit: I have plenty of files that don't fit into RAM, so I am not looking for a solution that reads the entire file into memory at once. 编辑:我有很多不适合RAM的文件,所以我不希望找到一种将整个文件立即读入内存的解决方案。

You don't need to use any special way of doing IO, you just need to change the order in which you do things. 您不需要使用任何特殊的IO方式,只需更改操作顺序即可。 So instead of opening all files and then processing the content, you open one file and print one line of output at a time. 因此,您无需打开所有文件然后再处理内容,而是打开一个文件并一次打印一行输出。

import Data.Digest.Pure.MD5 (md5)
import qualified Data.ByteString.Lazy as BS

main :: IO ()
main = mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path) 
   =<< getRecursiveContents "."

fileLine :: FilePath -> BS.ByteString -> String
fileLine path c = hash c ++ " " ++ path

hash :: BS.ByteString -> String 
hash = show . md5

BTW, I happen to be using a different md5 hash lib, the difference is not significant. 顺便说一句,我碰巧正在使用不同的md5哈希库,区别并不明显。

The main thing that is going on here is the line: 这行的主要内容是:

mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path)

It's opening a single file, it's consuming the whole content of the file and printing one line of output. 它正在打开一个文件,它消耗了文件的全部内容并打印一行输出。 It closes the file because it's consuming the whole content of the file. 它关闭了文件,因为它消耗了文件的全部内容。 Previously you were delaying when the file was consumed which delayed when the file was closed. 以前,您在使用文件时延迟,而在关闭文件时延迟。

If you are not quite sure if you are consuming all the input but want to make sure the file gets closed anyway, then you can use the withFile function from System.IO : 如果您不确定是否要使用所有输入,但是要确保文件withFile关闭,则可以使用System.IOwithFile函数:

mapM_ (\path -> withFile path ReadMode $ \hnd -> do
                  c <- BS.hGetContents hnd
                  putStrLn (fileLine path c))

The withFile function opens the file and passes the file handle to the body function. withFile函数打开文件,并将文件句柄传递给body函数。 It guarantees that the file gets closed when the body returns. 它可以确保在正文返回时关闭文件。 This "withBlah" pattern is very common when dealing with expensive resources. 当处理昂贵的资源时,这种“ withBlah”模式非常普遍。 This resource pattern is directly supported by System.Exception.bracket . System.Exception.bracket直接支持此资源模式。

Lazy IO is very bug-prone. 惰性IO非常容易发生错误。

As dons suggested, you should use strict IO. 正如唐斯建议,您应该使用严格的IO。

You can use a tool such as Iteratee to help you structure strict IO code. 您可以使用Iteratee之类的工具来帮助您构建严格的IO代码。 My favorite tool for this job is monadic lists. 我最喜欢这份工作的工具是单子列表。

import Control.Monad.ListT (ListT) -- List
import Control.Monad.IO.Class (liftIO) -- transformers
import Data.Binary (encode) -- binary
import Data.Digest.Pure.MD5 -- pureMD5
import Data.List.Class (repeat, takeWhile, foldlL) -- List
import System.IO (IOMode(ReadMode), openFile, hClose)
import qualified Data.ByteString.Lazy as BS
import Prelude hiding (repeat, takeWhile)

hashFile :: FilePath -> IO BS.ByteString
hashFile =
    fmap (encode . md5Finalize) . foldlL md5Update md5InitialContext . strictReadFileChunks 1024

strictReadFileChunks :: Int -> FilePath -> ListT IO BS.ByteString
strictReadFileChunks chunkSize filename =
    takeWhile (not . BS.null) $ do
        handle <- liftIO $ openFile filename ReadMode
        repeat () -- this makes the lines below loop
        chunk <- liftIO $ BS.hGet handle chunkSize
        when (BS.null chunk) . liftIO $ hClose handle
        return chunk

I used the "pureMD5" package here because "Crypto" doesn't seem to offer a "streaming" md5 implementation. 我在这里使用“ pureMD5”包是因为“ Crypto”似乎没有提供“流式” md5实现。

Monadic lists/ ListT come from the "List" package on hackage (transformers' and mtl's ListT are broken and also don't come with useful functions like takeWhile ) ListT list / ListT来自ListT的“ List”包(变形器和mtl的ListT损坏了,并且不带有takeWhile类的有用功能)

NOTE: I've edited my code slightly to reflect the advice in Duncan Coutts's answer . 注意:我已经稍微修改了代码以反映Duncan Coutts的答案中的建议。 Even after this edit his answer is obviously much better than mine, and doesn't seem to run out of memory in the same way. 即使进行了此编辑,他的答案显然也比我的要好得多,而且似乎并没有以相同的方式耗尽内存。


Here's my quick attempt at an Iteratee -based version. 这是我快速尝试基于Iteratee的版本。 When I run it on a directory with about 2,000 small (30-80K) files it's about 30 times faster than your version here and seems to use a bit less memory. 当我在包含大约2,000个小文件(30-80K)的目录上运行它时,它的速度比此处的版本快30倍并且似乎使用的内存更少。

For some reason it still seems to run out of memory on very large files—I don't really understand Iteratee well enough yet to be able to tell why easily. 由于某种原因,它似乎在非常大的文件上用尽了内存—我对Iteratee了解Iteratee很清楚,无法轻松地说出原因。

module Main where

import Control.Monad.State
import Data.Digest.Pure.MD5
import Data.List (sort)
import Data.Word (Word8) 
import System.Directory 
import System.FilePath ((</>))
import qualified Data.ByteString.Lazy as BS

import qualified Data.Iteratee as I
import qualified Data.Iteratee.WrappedByteString as IW

evalIteratee path = evalStateT (I.fileDriver iteratee path) md5InitialContext

iteratee :: I.IterateeG IW.WrappedByteString Word8 (StateT MD5Context IO) MD5Digest
iteratee = I.IterateeG chunk
  where
    chunk s@(I.EOF Nothing) =
      get >>= \ctx -> return $ I.Done (md5Finalize ctx) s
    chunk (I.Chunk c) = do
      modify $ \ctx -> md5Update ctx $ BS.fromChunks $ (:[]) $ IW.unWrap c
      return $ I.Cont (I.IterateeG chunk) Nothing

fileLine :: FilePath -> MD5Digest -> String
fileLine path c = show c ++ " " ++ path

main = mapM_ (\path -> putStrLn . fileLine path =<< evalIteratee path) 
   =<< getRecursiveContents "."

getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topdir = do
  names <- getDirectoryContents topdir

  let properNames = filter (`notElem` [".", ".."]) names

  paths <- concatForM properNames $ \name -> do
    let path = topdir </> name

    isDirectory <- doesDirectoryExist path
    if isDirectory
      then getRecursiveContents path
      else do
        isFile <- doesFileExist path
        if isFile
          then return [path]
          else return []

  return (sort paths)

concatForM :: (Monad m) => [a1] -> (a1 -> m [a]) -> m [a]
concatForM xs f = liftM concat (forM xs f)

Note that you'll need the iteratee package and TomMD's pureMD5 . 请注意,您将需要iteratee软件包和TomMD的pureMD5 (And my apologies if I've done something horrifying here—I'm a beginner with this stuff.) (如果我在这里做过令人恐惧的事情,我深表歉意。我是这方面的初学者。)

Edit: my assumption was that the user was opening thousands of very small files, it turns out they are very large. 编辑:我的假设是用户正在打开数千个非常小的文件,事实证明它们很大。 Laziness will be essential. 懒惰至关重要。

Well, you'll need to use a different IO mechanism. 好了,您将需要使用其他IO机制。 Either: 要么:

  • Strict IO (process the files with Data.ByteString or System.IO.Strict 严格IO(使用Data.ByteString或System.IO.Strict处理文件
  • or, Iteratee IO (for experts only at the moment). 或Iteratee IO(目前仅适用于专家)。

I'd also strongly recommend not using 'unpack', as that destroys the benefit of using bytestrings. 我也强烈建议不要使用'unpack',因为这会破坏使用字节串的好处。

For example, you can replace your lazy IO with System.IO.Strict, yielding: 例如,您可以将懒惰的IO替换为System.IO.Strict,得到:

import qualified System.IO.Strict as S

getList :: FilePath -> IO [String]
getList p = mapM getFileLine =<< getRecursiveContents p
    where
        getFileLine path = liftM (\c -> (hex (hash c)) ++ " " ++ path)
                                 (S.readFile path)

The problem is that mapM is not as lazy as you think - it results in a full list with one element per file path. 问题在于mapM并不像您想象的那样懒惰-它会导致一个完整列表,每个文件路径包含一个元素。 And the file IO you are using is lazy, so you get a list with one open file per file path. 而且您正在使用的文件IO 惰性的,因此您会得到一个列表,其中每个文件路径都有一个打开的文件。

The simplest solution in this case is to force the evaluation of the hash for each file path. 在这种情况下,最简单的解决方案是强制评估每个文件路径的哈希值。 One way to do that is with Control.Exception.evaluate : 一种方法是使用Control.Exception.evaluate

getFileLine path = do
  theHash <- liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
  evaluate theHash

As others have pointed out, we're working on a replacement for the current approach to lazy IO that is more general yet still simple. 正如其他人指出的那样,我们正在努力替代目前更通用但仍很简单的惰性IO方法。

EDIT: sorry, thought the problem was with the files, not diectory reading/traversal. 编辑:对不起,以为问题出在文件上,而不是目录读取/遍历。 Ignore this. 忽略这个。

No problem, just explicitly open the file (openFile), read the contents (Data.ByteString.Lazy.hGetContents), perform the md5 hash (let !h = md5 contents), and explicitly close the file (hClose). 没问题,只需显式打开文件(openFile),读取内容(Data.ByteString.Lazy.hGetContents),执行m​​d5哈希(让!h = md5内容),然后显式关闭文件(hClose)。

unsafeInterleaveIO? unsafeInterleaveIO?

Yet another solution that comes to mind is to use unsafeInterleaveIO from System.IO.Unsafe . 我想到的另一个解决方案是使用System.IO.Unsafe unsafeInterleaveIO See the reply of Tomasz Zielonka in this thread in Haskell Cafe. 在Haskell Cafe的此主题中查看Tomasz Zielonka的回复。

It defers an input-output operation (opening a file) until it is actually required. 它推迟了输入输出操作(打开文件),直到真正需要它为止。 Thus it is possible to avoid opening all files at once, and instead read and process them sequentially (open them lazily). 因此,可以避免一次打开所有文件,而是依次读取和处理它们(延迟打开)。

Now, I believe, mapM getFileLine opens all files but does not start reading from them until putStr . unlines 现在,我相信mapM getFileLine打开所有文件,但是直到putStr . unlines才开始从它们读取putStr . unlines putStr . unlines . putStr . unlines Thus a lot of thunks with open file handlers float around, this is the problem. 因此,许多具有打开文件处理程序的重击程序随处可见,这就是问题所在。 (Please correct me if I am wrong). (如果我错了,请纠正我)。

An example 一个例子

A modified example with unsafeInterleaveIO is running against a 100 GB directory for several minutes now, in constant space. 一个带有unsafeInterleaveIO修改示例现在在恒定空间中针对100 GB目录运行了几分钟。

getList :: FilePath -> IO [String]
getList p =
  let getFileLine path =
        liftM (\c -> (show . md5 $ c) ++ " " ++ path)
        (unsafeInterleaveIO $ BS.readFile path)
  in mapM getFileLine =<< getRecursiveContents p 

(I changed for pureMD5 implementation of the hash) (我更改为哈希的pureMD5实现)

PS I am not sure if this is good style. PS我不确定这是否是好的风格。 I believe that solutions with iteretees and strict IO are better, but this one is quicker to make. 我相信具有迭代器和严格的IO的解决方案会更好,但这是更快的解决方案。 I use it in small scripts, but I'd be afraid of relying on it in a bigger program. 我在小型脚本中使用它,但是我害怕在更大的程序中依赖它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM