简体   繁体   中英

HDFS buffered write/read operations

I am using the HDFS Java API and FSDataOutput and FSDataInput streams to write/read files to a Hadoop 2.6.0 cluster of 4 machines.

The FS stream implementations have a bufferSize constructor parameter which I assume is for the internal cache of the stream. But it seems that it has absolutely no effect at all to the write/read speed, regardless of its value (I tried values between 8KB and up to several MBytes).

I was wondering if there is some way to achieve buffered write/read to HDFS cluster, different from wrapping the FSDataOutput/Input into BufferedOutput/Input streams?

I have found the answer.

The bufferSize parameter of the FileSystem.create() is actually io.file.buffer.size which as we can read from the documentation is:

"The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations."

From the book "Hadoop: The Definitive Guide" we can read that a good starting point is setting it to 128KB.

As for the internal cache in the client side: Hadoop transmits data in the form of packets (default size is 64KB). This parameter can be tweaked with the dfs.client-write-packet-size option in the Hadoop hdfs-site.xml configuration. For my purposes I used 4MB.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM