简体   繁体   English

懒惰地阅读D中的文件

[英]Lazily Reading a File in D

I'm writing a directory tree scanning function in D that tries to combine tools such as grep and file and conditionally grep for things in a file only if it's not matching a set of magic bytes indicating filetypes such as ELF, images, etc. 我在D中编写了一个目录树扫描函数,它试图将文件中的grep和file以及有条件grep等工具组合在一起,只要它匹配一组指示文件类型的魔术字节,如ELF,图像等。

What is the best approach to making such an exclusion logic run as fast as possible with regards to minimizing file io? 在最小化文件io方面,尽可能快地运行这种排除逻辑的最佳方法是什么? I typically don't want to read in the whole file if I only need to read some magic bytes in the beginning. 如果我只需要在开头读取一些魔术字节,我通常不想读取整个文件。 However to make the code more future-general (some magics may lie at the end or somewhere else than at the beginning) it would be nice if I could use a mmap-like interface to lazily fetch data from the disk only when I it is read. 然而,要使代码更具未来性(一些魔法可能位于最后或其他地方而不是开头),如果我可以使用类似mmap的接口来懒惰地从磁盘获取数据,那将是很好的读。 The array interface also simplifies my algorithms. 阵列接口也简化了我的算法。

Is D's std.mmfile the best option in this case? 在这种情况下,D的std.mmfile是最好的选择吗?

Update : According to this post I guess mmap is adviced: http://forum.dlang.org/thread/dlrwzrydzjusjlowavuc@forum.dlang.org 更新 :根据这篇文章,我想mmap建议: http//forum.dlang.org/thread/dlrwzrydzjusjlowavuc@forum.dlang.org

If I only need read-access as an array (opIndex) are there any cons to using std.mmfile over std.stdio.File or std.file ? 如果我只需要读访问作为数组(opIndex)是否有任何缺点,使用std.mmfilestd.stdio.Filestd.file

If you want to lazily read a file with Phobos, you pretty much have three options 如果你想懒惰地阅读Phobos文件,你几乎有三个选择

  1. Use std.stdio.File 's byLine and read a line at a time. 使用std.stdio.FilebyLine并一次读取一行。

  2. Use std.stdio.File 's byChunk and read a particular number of bytes at a time. 使用std.stdio.FilebyChunk并一次读取特定数量的字节。

  3. Use std.mmfile.MmFile and operate on the file as an array, taking advantage of mmap underneath the hood to avoid reading in the whole file. 使用std.mmfile.MmFile并将该文件作为一个数组运行,利用std.mmfile.MmFilemmap来避免读取整个文件。

I fully expect that #3 is going to be the fastest (profiling could prove differently, but I'd be very surprised given how fantastic mmap is). 我完全相信#3会变得最快(分析可能会有所不同,但考虑到mmap精彩程度,我会非常惊讶)。 It's also probably the easiest to use, because you get an array to operate on. 它也可能是最容易使用的,因为你可以使用一个阵列进行操作。 The only problem with MmFile that I'm aware of is that it's a class when it should arguably be a ref-counted struct so that it would clean itself up when you were done. 我所知道的MmFile的唯一问题是,当它应该被认为是一个重新计算的结构时它是一个类,以便它在你完成时自我清理。 Right now, if you don't want to wait for the GC to clean it up, you'd have to manually call unmap on it or use destroy to destroy it without freeing its memory (though destroy should be used with caution). 现在,如果你不想等待GC清理它,你必须手动调用它上面的unmap或使用destroy来销毁它而不释放它的内存(虽然应该谨慎使用destroy )。 There may be some sort of downside to using mmap (which would then naturally mean that there was a downside to using MmFile ), but I'm not aware of any. 使用mmap可能会有某种缺点(这自然意味着使用MmFile有一个缺点),但我不知道任何。

In the future, we're going to end up with some range-based streaming I/O stuff, which might be closer to what you need without actually using mmap , but that hasn't been completed yet, and mmap is so incredibly cool that there's a good chance that it would still be better to use MmFile . 在未来,我们将最终得到一些基于范围的流式I / O内容,这可能更接近您所需要的而不实际使用mmap ,但尚未完成,而且mmap非常酷使用MmFile

you can combine seek and rawread of std.stdio.File to do what you want 你可以结合seekrawreadstd.stdio.File做你想做什么

you can then do a rawRead for only the first few bytes 然后,您可以只为前几个字节执行rawRead

File file=//...

ubyte[1024] buff;
ubtye[] magic=file.rawRead(buff[0..4]);//only the first 4 bytes are read
//check magic

then depending on the OS' caching/read-ahead strategy this can be nearly as fast as mmfile, however multiple seeks will ruin the read-ahead behavior 然后根据操作系统的缓存/预读策略,这几乎和mmfile一样快,但多次搜索会破坏预读行为

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM