[英]Haskell lazy I/O and closing files
I've written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). 我写了一个小的Haskell程序来打印当前目录中所有文件的MD5校验和(递归搜索)。 Basically a Haskell version of md5deep
. 基本上是md5deep
的Haskell版本。 All is fine and dandy except if the current directory has a very large number of files, in which case I get an error like: 一切正常,但除非当前目录中有大量文件,否则我将收到如下错误:
<program>: <currentFile>: openBinaryFile: resource exhausted (Too many open files)
It seems Haskell's laziness is causing it not to close files, even after its corresponding line of output has been completed. 似乎Haskell的懒惰导致它不关闭文件,即使在其相应的输出行完成之后也是如此。
The relevant code is below. 相关代码如下。 The function of interest is getList
. 感兴趣的功能是getList
。
import qualified Data.ByteString.Lazy as BS
main :: IO ()
main = putStr . unlines =<< getList "."
getList :: FilePath -> IO [String]
getList p =
let getFileLine path = liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
in mapM getFileLine =<< getRecursiveContents p
hex :: [Word8] -> String
hex = concatMap (\x -> printf "%0.2x" (toInteger x))
getRecursiveContents :: FilePath -> IO [FilePath]
-- ^ Just gets the paths to all the files in the given directory.
Are there any ideas on how I could solve this problem? 关于如何解决这个问题有什么想法吗?
The entire program is available here: http://haskell.pastebin.com/PAZm0Dcb 整个程序可以在这里找到: http : //haskell.pastebin.com/PAZm0Dcb
Edit: I have plenty of files that don't fit into RAM, so I am not looking for a solution that reads the entire file into memory at once. 编辑:我有很多不适合RAM的文件,所以我不希望找到一种将整个文件立即读入内存的解决方案。
You don't need to use any special way of doing IO, you just need to change the order in which you do things. 您不需要使用任何特殊的IO方式,只需更改操作顺序即可。 So instead of opening all files and then processing the content, you open one file and print one line of output at a time. 因此,您无需打开所有文件然后再处理内容,而是打开一个文件并一次打印一行输出。
import Data.Digest.Pure.MD5 (md5)
import qualified Data.ByteString.Lazy as BS
main :: IO ()
main = mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path)
=<< getRecursiveContents "."
fileLine :: FilePath -> BS.ByteString -> String
fileLine path c = hash c ++ " " ++ path
hash :: BS.ByteString -> String
hash = show . md5
BTW, I happen to be using a different md5 hash lib, the difference is not significant. 顺便说一句,我碰巧正在使用不同的md5哈希库,区别并不明显。
The main thing that is going on here is the line: 这行的主要内容是:
mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path)
It's opening a single file, it's consuming the whole content of the file and printing one line of output. 它正在打开一个文件,它消耗了文件的全部内容并打印一行输出。 It closes the file because it's consuming the whole content of the file. 它关闭了文件,因为它消耗了文件的全部内容。 Previously you were delaying when the file was consumed which delayed when the file was closed. 以前,您在使用文件时延迟,而在关闭文件时延迟。
If you are not quite sure if you are consuming all the input but want to make sure the file gets closed anyway, then you can use the withFile
function from System.IO
: 如果您不确定是否要使用所有输入,但是要确保文件withFile
关闭,则可以使用System.IO
的withFile
函数:
mapM_ (\path -> withFile path ReadMode $ \hnd -> do
c <- BS.hGetContents hnd
putStrLn (fileLine path c))
The withFile
function opens the file and passes the file handle to the body function. withFile
函数打开文件,并将文件句柄传递给body函数。 It guarantees that the file gets closed when the body returns. 它可以确保在正文返回时关闭文件。 This "withBlah" pattern is very common when dealing with expensive resources. 当处理昂贵的资源时,这种“ withBlah”模式非常普遍。 This resource pattern is directly supported by System.Exception.bracket
. System.Exception.bracket
直接支持此资源模式。
Lazy IO is very bug-prone. 惰性IO非常容易发生错误。
As dons suggested, you should use strict IO. 正如唐斯建议,您应该使用严格的IO。
You can use a tool such as Iteratee to help you structure strict IO code. 您可以使用Iteratee之类的工具来帮助您构建严格的IO代码。 My favorite tool for this job is monadic lists. 我最喜欢这份工作的工具是单子列表。
import Control.Monad.ListT (ListT) -- List
import Control.Monad.IO.Class (liftIO) -- transformers
import Data.Binary (encode) -- binary
import Data.Digest.Pure.MD5 -- pureMD5
import Data.List.Class (repeat, takeWhile, foldlL) -- List
import System.IO (IOMode(ReadMode), openFile, hClose)
import qualified Data.ByteString.Lazy as BS
import Prelude hiding (repeat, takeWhile)
hashFile :: FilePath -> IO BS.ByteString
hashFile =
fmap (encode . md5Finalize) . foldlL md5Update md5InitialContext . strictReadFileChunks 1024
strictReadFileChunks :: Int -> FilePath -> ListT IO BS.ByteString
strictReadFileChunks chunkSize filename =
takeWhile (not . BS.null) $ do
handle <- liftIO $ openFile filename ReadMode
repeat () -- this makes the lines below loop
chunk <- liftIO $ BS.hGet handle chunkSize
when (BS.null chunk) . liftIO $ hClose handle
return chunk
I used the "pureMD5" package here because "Crypto" doesn't seem to offer a "streaming" md5 implementation. 我在这里使用“ pureMD5”包是因为“ Crypto”似乎没有提供“流式” md5实现。
Monadic lists/ ListT
come from the "List" package on hackage (transformers' and mtl's ListT
are broken and also don't come with useful functions like takeWhile
) ListT
list / ListT
来自ListT
的“ List”包(变形器和mtl的ListT
损坏了,并且不带有takeWhile
类的有用功能)
NOTE: I've edited my code slightly to reflect the advice in Duncan Coutts's answer . 注意:我已经稍微修改了代码以反映Duncan Coutts的答案中的建议。 Even after this edit his answer is obviously much better than mine, and doesn't seem to run out of memory in the same way. 即使进行了此编辑,他的答案显然也比我的要好得多,而且似乎并没有以相同的方式耗尽内存。
Here's my quick attempt at an Iteratee
-based version. 这是我快速尝试基于Iteratee
的版本。 When I run it on a directory with about 2,000 small (30-80K) files it's about 30 times faster than your version here and seems to use a bit less memory. 当我在包含大约2,000个小文件(30-80K)的目录上运行它时,它的速度比此处的版本快30倍,并且似乎使用的内存更少。
For some reason it still seems to run out of memory on very large files—I don't really understand Iteratee
well enough yet to be able to tell why easily. 由于某种原因,它似乎在非常大的文件上用尽了内存—我对Iteratee
了解Iteratee
很清楚,无法轻松地说出原因。
module Main where
import Control.Monad.State
import Data.Digest.Pure.MD5
import Data.List (sort)
import Data.Word (Word8)
import System.Directory
import System.FilePath ((</>))
import qualified Data.ByteString.Lazy as BS
import qualified Data.Iteratee as I
import qualified Data.Iteratee.WrappedByteString as IW
evalIteratee path = evalStateT (I.fileDriver iteratee path) md5InitialContext
iteratee :: I.IterateeG IW.WrappedByteString Word8 (StateT MD5Context IO) MD5Digest
iteratee = I.IterateeG chunk
where
chunk s@(I.EOF Nothing) =
get >>= \ctx -> return $ I.Done (md5Finalize ctx) s
chunk (I.Chunk c) = do
modify $ \ctx -> md5Update ctx $ BS.fromChunks $ (:[]) $ IW.unWrap c
return $ I.Cont (I.IterateeG chunk) Nothing
fileLine :: FilePath -> MD5Digest -> String
fileLine path c = show c ++ " " ++ path
main = mapM_ (\path -> putStrLn . fileLine path =<< evalIteratee path)
=<< getRecursiveContents "."
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topdir = do
names <- getDirectoryContents topdir
let properNames = filter (`notElem` [".", ".."]) names
paths <- concatForM properNames $ \name -> do
let path = topdir </> name
isDirectory <- doesDirectoryExist path
if isDirectory
then getRecursiveContents path
else do
isFile <- doesFileExist path
if isFile
then return [path]
else return []
return (sort paths)
concatForM :: (Monad m) => [a1] -> (a1 -> m [a]) -> m [a]
concatForM xs f = liftM concat (forM xs f)
Note that you'll need the iteratee
package and TomMD's pureMD5
. 请注意,您将需要iteratee
软件包和TomMD的pureMD5
。 (And my apologies if I've done something horrifying here—I'm a beginner with this stuff.) (如果我在这里做过令人恐惧的事情,我深表歉意。我是这方面的初学者。)
Edit: my assumption was that the user was opening thousands of very small files, it turns out they are very large. 编辑:我的假设是用户正在打开数千个非常小的文件,事实证明它们很大。 Laziness will be essential. 懒惰至关重要。
Well, you'll need to use a different IO mechanism. 好了,您将需要使用其他IO机制。 Either: 要么:
I'd also strongly recommend not using 'unpack', as that destroys the benefit of using bytestrings. 我也强烈建议不要使用'unpack',因为这会破坏使用字节串的好处。
For example, you can replace your lazy IO with System.IO.Strict, yielding: 例如,您可以将懒惰的IO替换为System.IO.Strict,得到:
import qualified System.IO.Strict as S
getList :: FilePath -> IO [String]
getList p = mapM getFileLine =<< getRecursiveContents p
where
getFileLine path = liftM (\c -> (hex (hash c)) ++ " " ++ path)
(S.readFile path)
The problem is that mapM is not as lazy as you think - it results in a full list with one element per file path. 问题在于mapM并不像您想象的那样懒惰-它会导致一个完整列表,每个文件路径包含一个元素。 And the file IO you are using is lazy, so you get a list with one open file per file path. 而且您正在使用的文件IO 是惰性的,因此您会得到一个列表,其中每个文件路径都有一个打开的文件。
The simplest solution in this case is to force the evaluation of the hash for each file path. 在这种情况下,最简单的解决方案是强制评估每个文件路径的哈希值。 One way to do that is with Control.Exception.evaluate
: 一种方法是使用Control.Exception.evaluate
:
getFileLine path = do
theHash <- liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
evaluate theHash
As others have pointed out, we're working on a replacement for the current approach to lazy IO that is more general yet still simple. 正如其他人指出的那样,我们正在努力替代目前更通用但仍很简单的惰性IO方法。
EDIT: sorry, thought the problem was with the files, not diectory reading/traversal. 编辑:对不起,以为问题出在文件上,而不是目录读取/遍历。 Ignore this. 忽略这个。
No problem, just explicitly open the file (openFile), read the contents (Data.ByteString.Lazy.hGetContents), perform the md5 hash (let !h = md5 contents), and explicitly close the file (hClose). 没问题,只需显式打开文件(openFile),读取内容(Data.ByteString.Lazy.hGetContents),执行md5哈希(让!h = md5内容),然后显式关闭文件(hClose)。
Yet another solution that comes to mind is to use unsafeInterleaveIO
from System.IO.Unsafe
. 我想到的另一个解决方案是使用System.IO.Unsafe
unsafeInterleaveIO
。 See the reply of Tomasz Zielonka in this thread in Haskell Cafe. 在Haskell Cafe的此主题中查看Tomasz Zielonka的回复。
It defers an input-output operation (opening a file) until it is actually required. 它推迟了输入输出操作(打开文件),直到真正需要它为止。 Thus it is possible to avoid opening all files at once, and instead read and process them sequentially (open them lazily). 因此,可以避免一次打开所有文件,而是依次读取和处理它们(延迟打开)。
Now, I believe, mapM getFileLine
opens all files but does not start reading from them until putStr . unlines
现在,我相信mapM getFileLine
打开所有文件,但是直到putStr . unlines
才开始从它们读取putStr . unlines
putStr . unlines
. putStr . unlines
。 Thus a lot of thunks with open file handlers float around, this is the problem. 因此,许多具有打开文件处理程序的重击程序随处可见,这就是问题所在。 (Please correct me if I am wrong). (如果我错了,请纠正我)。
A modified example with unsafeInterleaveIO
is running against a 100 GB directory for several minutes now, in constant space. 一个带有unsafeInterleaveIO
的修改示例现在在恒定空间中针对100 GB目录运行了几分钟。
getList :: FilePath -> IO [String]
getList p =
let getFileLine path =
liftM (\c -> (show . md5 $ c) ++ " " ++ path)
(unsafeInterleaveIO $ BS.readFile path)
in mapM getFileLine =<< getRecursiveContents p
(I changed for pureMD5 implementation of the hash) (我更改为哈希的pureMD5实现)
PS I am not sure if this is good style. PS我不确定这是否是好的风格。 I believe that solutions with iteretees and strict IO are better, but this one is quicker to make. 我相信具有迭代器和严格的IO的解决方案会更好,但这是更快的解决方案。 I use it in small scripts, but I'd be afraid of relying on it in a bigger program. 我在小型脚本中使用它,但是我害怕在更大的程序中依赖它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.