简体繁体 English

Java：对巨大磁盘文件进行随机读取的最快方法

[英]Java: fastest way to do random reads on huge disk file(s)

原文 2010-02-27 09:18:49 4 4 java/ nio

I've got a moderately big set of data, about 800 MB or so, that is basically some big precomputed table that I need to speed some computation by several orders of magnitude (creating that file took several mutlicores computers days to produce using an optimized and multi-threaded algo... I do really need that file). 我有一个中等大小的数据集，大约800 MB左右，这基本上是一个很大的预计算表，我需要将一些计算加速几个数量级（创建该文件需要几个mutlicores计算机天来使用优化生成和多线程算法......我真的需要那个文件）。

Now that it has been computed once, that 800MB of data is read only. 现在已经计算了一次，那么800MB的数据是只读的。

I cannot hold it in memory. 我无法忍受它在记忆中。

As of now it is one big huge 800MB file but splitting in into smaller files ain't a problem if it can help. 截至目前，它是一个巨大的800MB文件，但如果可以提供帮助，拆分成较小的文件不是问题。

I need to read about 32 bits of data here and there in that file a lot of time. 我需要在该文件中读取大约32位数据的大量时间。 I don't know before hand where I'll need to read these data: the reads are uniformly distributed. 我不知道在哪里我需要读取这些数据：读取是均匀分布的。

What would be the fastest way in Java to do my random reads in such a file or files? 在这样的文件或文件中随机读取Java的最快方法是什么？ Ideally I should be doing these reads from several unrelated threads (but I could queue the reads in a single thread if needed). 理想情况下，我应该从几个不相关的线程进行这些读取（但如果需要，我可以在单个线程中对读取进行排队）。

Is Java NIO the way to go? Java NIO是可行的吗？

I'm not familiar with 'memory mapped file': I think I don't want to map the 800 MB in memory. 我不熟悉'内存映射文件'：我想我不想在内存中映射800 MB。

All I want is the fastest random reads I can get to access these 800MB of disk-based data. 我想要的只是访问这些800MB基于磁盘的数据的最快随机读取。

btw in case people wonder this is not at all the same as the question I asked not long ago: 顺便说一下，如果人们想知道这与我不久前提出的问题完全不同：

Java: fast disk-based hash set Java：基于磁盘的快速哈希集

4 个解决方案

800MB is not that much to load up and store in memory. 加载和存储在内存中的800MB并不多。 If you can afford to have multicore machines ripping away at a data set for days on end, you can afford an extra GB or two of RAM, no? 如果你有能力让多核机器在数据集上连续几天被剥夺，那么你可以支付额外的GB或两个RAM，不是吗？

That said, read up on Java's java.nio.MappedByteBuffer . 也就是说，阅读Java的java.nio.MappedByteBuffer 。 It is clear from your comment "I think I don't want to map the 800 MB in memory" that the concept is not clear. 从您的评论中可以清楚地看出“我认为我不想在内存中映射800 MB” ，这个概念并不清楚。

In a nut shell, a mapped byte buffer allows one to programmatically access the data as it were in memory, although it may be on disk or in memory --this is for the OS to decide, as Java's MBB is based on the OS's Virtual Memory subsystem. 在一个坚果shell中， 映射的字节缓冲区允许以编程方式访问内存中的数据，尽管它可能在磁盘上或内存中 -这是由操作系统决定的，因为Java的MBB基于操作系统的虚拟内存子系统。 It is also nice and fast. 它也很好而且快速。 You will also be able to access a single MBB from multiple threads safely. 您还可以安全地从多个线程访问单个MBB。

Here are the steps I recommend you take: 以下是我建议您采取的步骤：

Instantiate a MappedByteBuffer that maps your data file to the MBB. 实例化将数据文件映射到MBB的MappedByteBuffer。 The creation is kinda expensive, so keep it around. 创作有点贵，所以请保持它。
In your look up method... 在你的查找方法......
1. instantiate a byte[4] array 实例化一个byte[4]数组
2. call .get(byte[] dst, int offset, int length) 调用.get(byte[] dst, int offset, int length)
3. the byte array will now have your data, which you can turn into a value 字节数组现在将拥有您的数据，您可以将其转换为值

And presto! 并且presto！ You have your data! 你有你的数据！

I'm a big fan of MBBs and have used them successfully for such tasks in the past. 我是MBB的忠实粉丝，并且过去曾成功地将它们用于此类任务。

RandomAccessFile (blocking) may help: http://java.sun.com/javase/6/docs/api/java/io/RandomAccessFile.html RandomAccessFile（阻塞）可能有所帮助： http ： //java.sun.com/javase/6/docs/api/java/io/RandomAccessFile.html

You can also use FileChannel.map() to map a region of file to memory, then read the MappedByteBuffer . 您还可以使用FileChannel.map()将文件区域映射到内存，然后读取MappedByteBuffer 。

Actually 800 MB isn't very big. 实际上800 MB不是很大。 If you have 2 GB of memory or more, it can reside in disk cache if not in your application itself. 如果您有2 GB或更多内存，则它可以驻留在磁盘缓存中（如果不在您的应用程序本身中）。

For the write case, on Java 7, AsynchronousFileChannel should be looked at. 对于写案例，在Java 7上，应该查看AsynchronousFileChannel。

When performing random record-oriented writes across large files (exceeding physical memory so caching isn't helping everything) on NTFS, I find that AsynchronousFileChannel performs over twice as many operations, in single-threaded mode, versus a normal FileChannel (on a 10GB file, 160 byte records, completely random writes, some random content, several hundred iterations of benchmarking loop to achieve steady state, roughly 5,300 writes per second). 在NTFS上对大文件执行随机面向记录的写入（超过物理内存，因此缓存不能帮助所有内容）时，我发现AsynchronousFileChannel在单线程模式下执行的操作数量是普通FileChannel的两倍（在10GB上）文件，160字节记录，完全随机写入，一些随机内容，基准测试循环的几百次迭代以实现稳定状态，大约每秒5,300次写入）。

My best guess is that because the asynchronous io boils down to overlapped IO in Windows 7, the NTFS file system driver is able to update its own internal structures faster when it doesn't have to create a sync point after every call. 我最好的猜测是，因为异步io归结为Windows 7中重叠的IO，所以NTFS文件系统驱动程序能够在每次调用后不必创建同步点时更快地更新自己的内部结构。

I micro-benchmarked against RandomAccessFile to see how it would perform (results are very close to FileChannel, and still half of the performance of AsynchronousFileChannel. 我对RandomAccessFile进行了微基准测试，看看它是如何执行的（结果非常接近FileChannel，并且仍然是AsynchronousFileChannel的一半性能。

Not sure what happens with multi-threaded writes. 不确定多线程写入会发生什么。 This is on Java 7, on an SSD (the SSD is an order of magnitude faster than magnetic, and another order of magnitude faster on smaller files that fit in memory). 这是在Java 7上，在SSD上（SSD比磁性快一个数量级，在适合内存的较小文件上快一个数量级）。

Will be interesting to see if the same ratios hold on Linux. 看看Linux上是否存在相同的比率将会很有趣。