Concurrent Reads with a Java MappedByteBuffer

Question

I'm trying to use a MappedByteBuffer to allow concurrent reads on a file by multiple threads with the following constraints:

File is too large to load into memory
Threads must be able to read asynchronously (it's a web app)
The file is never written to by any thread
Every thread will always know the exact offset and length of bytes it needs to read (ie - no "seeking" by the app itself).

According to the docs ( https://docs.oracle.com/javase/8/docs/api/java/nio/Buffer.html ) Buffers are not thread-safe since they keep internal state (position, etc). Is there a way to have concurrent random access to the file without loading it all into memory?

Although FileChannel is technically thread-safe, from the docs:

Where the file channel is obtained from an existing stream or random access file then the state of the file channel is intimately connected to that of the object whose getChannel method returned the channel. Changing the channel's position, whether explicitly or by reading or writing bytes, will change the file position of the originating object, and vice versa

So it would seem that it's simply synchronized. If I were to new RandomAccessFile().getChannel().map() in each thread [edit: on every read] then doesn't that incur the I/O overhead with each read that MappedByteBuffers are supposed to avoid?

Answer 1

Rather than using multiple threads for concurrent reads, I'd go with this approach (based on an example with a huge CSV file whose lines have to be sent concurrently via HTTP):

Reading a single file at multiple positions concurrently wouldn't let you go any faster (but it could slow you down considerably).

Instead of reading the file from multiple threads, read the file from a single thread, and parallelize the processing of these lines. A single thread should read your CSV line-by-line, and put each line in a queue. Multiple working threads should then take the next line from the queue, parse it, convert to a request, and process the request concurrently as needed. The splitting of the work would then be done by a single thread, ensuring that there are no missing lines or overlaps.

If you can read the file line by line, LineIterator from Commons IO is a memory-efficient possibility. If you have to work with chunks, your MappedByteBuffer seems to be a reasonable approach. For the queue, I'd use a blocking queue with a fixed capacity—such as ArrayBlockingQueue —to better control the memory usage (lines/chunks in queue + lines/chunks among workers = lines/chunks in memory).

Answer 2

FileChannel supports a read operation without synchronization. It natively uses pread on Linux:

public abstract int read(ByteBuffer dst, long position) throws IOException

Here is on the FileChannel documentation:

...Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.

It is pretty primitive by returning how many bytes were read (see details here ). But I think you can still make use of it with your assumption that "Every thread will always know the exact offset and length of bytes it needs to read"

Concurrent Reads with a Java MappedByteBuffer

Question

2 answers

solution1
0 2017-05-24 22:43:19

solution2
0 2017-05-25 01:13:42

Concurrent Reads with a Java MappedByteBuffer

Question

2 answers

solution1 0 2017-05-24 22:43:19

solution2 0 2017-05-25 01:13:42

solution1
0 2017-05-24 22:43:19

solution2
0 2017-05-25 01:13:42