简体繁体 English

Java - 并行读取二进制文件

[英]Java - Reading A Binary File In Parallel

原文 2012-06-19 22:18:15 7 3 java/ multithreading/ file-io/ random-access

I have a binary file that contains blocks of information ( I'll refer to them as packets henceforth ). 我有一个包含信息块的二进制文件（ 我将其称为数据包 ）。 Each packet consists of a fixed-length header and a variable length body. 每个数据包由一个固定长度的头和一个可变长度的主体组成。 I've to determine the lenth of the body from the packet header itself. 我要从包头本身确定身体的长度。 My task is to read these packets from the file and perform some operation on them. 我的任务是从文件中读取这些数据包并对它们执行一些操作。 Currently I'm performing this task as follows: 目前我正在执行以下任务：

Opening the file as a random access file and going to a specific start position (a user-specified start position). 将文件作为随机访问文件打开并转到特定的开始位置（用户指定的起始位置）。 Reading the 1st packet from this position. 从这个位置读取第一个数据包。 Performing the specific operation 执行特定操作
Then in a loop 然后在循环中
- reading the next packet 读下一个包
- performing my operation This goes on till I hit the end of file marker. 执行我的操作这一直持续到文件标记结束。

As you can guess, when the file size is huge, reading each packet serially and processing it is a time-consuming affair. 你可以猜到，当文件大小很大时，连续读取每个数据包并处理它是一件非常耗时的事情。 I want to somehow parallelize this operation ie packet generation operation and put it in some blocking queue and then parallely retrieve each packet from the queue and perform my operation. 我想以某种方式并行化这个操作，即数据包生成操作，并将其放入一些阻塞队列，然后从队列中并行检索每个数据包并执行我的操作。

Can someone suggest how may I generate these packets in parallel? 有人可以建议我如何并行生成这些数据包？

3 个解决方案

You should only have one thread read in the file sequentially since I'm assuming the file lies in a single drive. 您应该只在文件中按顺序读取一个线程，因为我假设文件位于单个驱动器中。 Reading the file is limited by your IO speed so there's no point in parallelizing that in the CPU. 读取文件受到IO速度的限制，因此在CPU中并行化是没有意义的。 In fact, reading non-sequentially will actually significantly decrease your performance since regular hard drives are designed for sequential IO. 实际上，非顺序读取实际上会显着降低性能，因为常规硬盘设计用于顺序IO。 For each packet it reads in, it should put that object into a thread-safe queue. 对于它读入的每个数据包，它应该将该对象放入一个线程安全的队列中。

Now you can start parallelizing the processing of the packets. 现在您可以开始并行处理数据包了。 Create multiple threads and have them each read in packets from the queue. 创建多个线程，让每个线程从队列中读取数据包。 Each thread should do their processing and put it into some "finished" queue. 每个线程都应该进行处理并将其放入一些“已完成”的队列中。

Once the IO thread has finished reading in the file, a flag should be set so that the working threads stop once the queue is empty. 一旦IO线程读完文件，就应该设置一个标志，以便在队列为空时工作线程停止。

If you are using a disk with platters (ie not an SSD) then there is no point having more than one thread read the file since all you will do is thrash the disk causing the disk arm to introduce millisecond delays. 如果您正在使用带有盘片的磁盘（即不是SSD），则没有必要让多个线程读取该文件，因为您要做的就是使磁盘抖动，导致磁盘臂引入毫秒延迟。 If you have an SSD its a different story and you could parallelise the reading. 如果你有一个不同的故事SSD，你可以将阅读并行化。

Instead you should have one thread reading the data from the file and creating the packets, then doing the following: 相反，您应该有一个线程从文件中读取数据并创建数据包，然后执行以下操作：

wait on a shared semaphore 'A' (which has been initialised to some number that will be your 'max buffered packets' count) 等待共享信号量'A'（已被初始化为某个数字，这将是您的'最大缓冲数据包'计数）
lock a shared object 锁定共享对象
append the packet to a LinkedList 将数据包附加到LinkedList
signal another shared semaphore 'B' (this one is tracking the count of the packets in the buffer) 发信号通知另一个共享信号量'B'（这个是跟踪缓冲区中数据包的计数）

Then you can have many other threads doing the following: 然后你可以让许多其他线程执行以下操作：

wait on the 'B' semaphore (to ensure there is a packet to be processed) 等待'B'信号量（以确保有一个要处理的数据包）
lock the shared object 锁定共享对象
do getFirst() on the LinkedList and store the packet in a local variable 在LinkedList上执行getFirst（）并将数据包存储在本地变量中
signal semaphore 'A' to allow another packet into the buffered packet list 信号信号量'A'允许另一个数据包进入缓冲的数据包列表

This will ensure you are reading packets as fast as possible (from a platter disk) by striping them in one continuous sequence, and it will ensure that you are processing multiple packets at once without any polling. 这将确保您通过一个连续的序列对它们进行快速读取（从盘片磁盘）读取数据包，并确保您一次处理多个数据包而不进行任何轮询。

我想已知的快速方法是使用java.nio.MappedByteBuffer