简体   繁体   中英

How compression works in Hadoop

In my MR job, let us say, I specify the compression for either the map or reduce output to LZO, how does it get compressed? Is the entire data from map or reduce task first obtained without compression and then at the end, the uncompressed data gets compressed, or does it get incrementally compressed and written. If it gets incrementally compressed and written, then how is it done? Please help me understand this.

Thanks,

Venkat

It basically depends on the file type you use. If it is a text file then compression happens at the file level. But if it is SequenceFile then compression could be at record level or block level. Note that here block means a buffer in using sequence file and not the hdfs block.

If it is block compression then multiple records are compressed into a block at once. Records are added to a block until it reaches a minimum size in bytes. The maximum size of input data to be compressed at a time is calculated by subtracting the maximum overhead of the compression algorithm from the buffer size. The default buffer size is 512 bytes and for compression overhead it's 18 bytes(1% of bufferSize + 12 bytes) for zlib algorithm. Then a BlockCompressorStream is created with given output-stream and compressor and the compressed data gets written.

Hope this answers the question to some extent.

I thought I would add a little more detail to Tariq's answer, by explaining where compression fits into the mapreduce pipeline at a higher level. Hopefully it is helpful.

If you specify compression for the map stage ( mapreduce.map.output.compress=true ) the intermediate map output data will be compressed using whatever codec you've specified ( mapreduce.map.ouput.compress.codec=org.apache.hadoop.io.compress.* ) and saved to disk at the conclusion of each map task's completion (or earlier if the map task exceeds it serialization buffer limit, and begins to spill to the disk). The compressed data is then read from the disk and sent to the appropriate nodes during the Shuffle & Sort stage of your mapreduce job.

At this stage (map output) the compression resultant has no benefit in being splittable, so the GZIP or Snappy codec are worth trying here as well as LZO and BZIP2. GZIP typically has better compression ratios for most data but heavily consumes the CPU, whilst Snappy is faster with a lower compression ratio(ie it either has less latency or doesn't consume the CPU as heavily as GZIP…I'm not positive on the reason). Using data generated from teragen, the compression ratio of GZIP vs Snappy is 3.5x and 2.5x respectively. Obviously, your data and your hardware limitations will dictate what the most beneficial codec is in your situation.

Compression before the the shuffle & sort stage is helpful in that it reduces disk IO, and reduces network bandwidth since you're sending the data compressed across the wire. I can't think of a good reason to not compress data at this stage so long as the CPU resources to do so are not being contended for. In my little 10 node Hadoop cluster running on a 1 Gb network turning on compression for just the map output phase (ie the intermediate map data before the shuffle & sort stage was compressed; the final output was not compressed) improved the overall job time of a 100GB terasort job by 41% (GZIP) , and 45%(Snappy) versus not using compression. The data in those experiments was generated using teragen. Your results will vary based on your data set, hardware, and network, of course.

The compressed data is then decompressed at the start of the reduce phase.

Compression comes into play once more at the end of the reduce phase for the final output (mapreduce.output.fileoutputformat.compress=true). Here is where splittable LZO or BZIP2 compression might be useful if you are feeding the output into another mapreduce job. If you don't use a splittable compression codec on the output and run a job on that data, only a single mapper can be used which defeats the one of the major benefits of Hadoop; parallelization. One way to get around this, and use something like the GZIP codec, is to create a sequence file for the output. A sequence file is splittable because it is essentially a series of compressed files appended together. The sequence file is splittable at the boundaries where each file is appended to the other.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM