[英]Lazily Reading a File in D
I'm writing a directory tree scanning function in D that tries to combine tools such as grep and file and conditionally grep for things in a file only if it's not matching a set of magic bytes indicating filetypes such as ELF, images, etc. 我在D中编写了一个目录树扫描函数,它试图将文件中的grep和file以及有条件grep等工具组合在一起,只要它不匹配一组指示文件类型的魔术字节,如ELF,图像等。
What is the best approach to making such an exclusion logic run as fast as possible with regards to minimizing file io? 在最小化文件io方面,尽可能快地运行这种排除逻辑的最佳方法是什么? I typically don't want to read in the whole file if I only need to read some magic bytes in the beginning.
如果我只需要在开头读取一些魔术字节,我通常不想读取整个文件。 However to make the code more future-general (some magics may lie at the end or somewhere else than at the beginning) it would be nice if I could use a mmap-like interface to lazily fetch data from the disk only when I it is read.
然而,要使代码更具未来性(一些魔法可能位于最后或其他地方而不是开头),如果我可以使用类似mmap的接口来懒惰地从磁盘获取数据,那将是很好的读。 The array interface also simplifies my algorithms.
阵列接口也简化了我的算法。
Is D's std.mmfile
the best option in this case? 在这种情况下,D的
std.mmfile
是最好的选择吗?
Update : According to this post I guess mmap is adviced: http://forum.dlang.org/thread/dlrwzrydzjusjlowavuc@forum.dlang.org 更新 :根据这篇文章,我想mmap建议: http : //forum.dlang.org/thread/dlrwzrydzjusjlowavuc@forum.dlang.org
If I only need read-access as an array (opIndex) are there any cons to using std.mmfile
over std.stdio.File
or std.file
? 如果我只需要读访问作为数组(opIndex)是否有任何缺点,使用
std.mmfile
在std.stdio.File
或std.file
?
If you want to lazily read a file with Phobos, you pretty much have three options 如果你想懒惰地阅读Phobos文件,你几乎有三个选择
Use std.stdio.File
's byLine
and read a line at a time. 使用
std.stdio.File
的byLine
并一次读取一行。
Use std.stdio.File
's byChunk
and read a particular number of bytes at a time. 使用
std.stdio.File
的byChunk
并一次读取特定数量的字节。
Use std.mmfile.MmFile
and operate on the file as an array, taking advantage of mmap
underneath the hood to avoid reading in the whole file. 使用
std.mmfile.MmFile
并将该文件作为一个数组运行,利用std.mmfile.MmFile
的mmap
来避免读取整个文件。
I fully expect that #3 is going to be the fastest (profiling could prove differently, but I'd be very surprised given how fantastic mmap
is). 我完全相信#3会变得最快(分析可能会有所不同,但考虑到
mmap
精彩程度,我会非常惊讶)。 It's also probably the easiest to use, because you get an array to operate on. 它也可能是最容易使用的,因为你可以使用一个阵列进行操作。 The only problem with
MmFile
that I'm aware of is that it's a class when it should arguably be a ref-counted struct so that it would clean itself up when you were done. 我所知道的
MmFile
的唯一问题是,当它应该被认为是一个重新计算的结构时它是一个类,以便它在你完成时自我清理。 Right now, if you don't want to wait for the GC to clean it up, you'd have to manually call unmap
on it or use destroy
to destroy it without freeing its memory (though destroy
should be used with caution). 现在,如果你不想等待GC清理它,你必须手动调用它上面的
unmap
或使用destroy
来销毁它而不释放它的内存(虽然应该谨慎使用destroy
)。 There may be some sort of downside to using mmap
(which would then naturally mean that there was a downside to using MmFile
), but I'm not aware of any. 使用
mmap
可能会有某种缺点(这自然意味着使用MmFile
有一个缺点),但我不知道任何。
In the future, we're going to end up with some range-based streaming I/O stuff, which might be closer to what you need without actually using mmap
, but that hasn't been completed yet, and mmap
is so incredibly cool that there's a good chance that it would still be better to use MmFile
. 在未来,我们将最终得到一些基于范围的流式I / O内容,这可能更接近您所需要的而不实际使用
mmap
,但尚未完成,而且mmap
非常酷使用MmFile
。
you can combine seek
and rawread
of std.stdio.File
to do what you want 你可以结合
seek
和rawread
的std.stdio.File
做你想做什么
you can then do a rawRead for only the first few bytes 然后,您可以只为前几个字节执行rawRead
File file=//...
ubyte[1024] buff;
ubtye[] magic=file.rawRead(buff[0..4]);//only the first 4 bytes are read
//check magic
then depending on the OS' caching/read-ahead strategy this can be nearly as fast as mmfile, however multiple seeks will ruin the read-ahead behavior 然后根据操作系统的缓存/预读策略,这几乎和mmfile一样快,但多次搜索会破坏预读行为
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.