简体繁体 English

压缩如何在Hadoop中运行

[英]How compression works in Hadoop

原文 2013-06-15 15:55:44 6 2 java/ hadoop/ mapreduce/ compression

In my MR job, let us say, I specify the compression for either the map or reduce output to LZO, how does it get compressed? 在我的MR工作中，让我们说，我指定地图的压缩或减少输出到LZO，它是如何压缩的？ Is the entire data from map or reduce task first obtained without compression and then at the end, the uncompressed data gets compressed, or does it get incrementally compressed and written. 来自map或reduce任务的整个数据是否首先在没有压缩的情况下获得，然后在最后，未压缩的数据被压缩，或者是否逐步压缩和写入。 If it gets incrementally compressed and written, then how is it done? 如果它被逐步压缩和写入，那么它是如何完成的？ Please help me understand this. 请帮我理解这个。

Thanks, 谢谢，

Venkat Venkat

2 个解决方案

It basically depends on the file type you use. 它主要取决于您使用的文件类型。 If it is a text file then compression happens at the file level. 如果是文本文件，则在文件级别进行压缩。 But if it is SequenceFile then compression could be at record level or block level. 但如果它是SequenceFile，则压缩可以是记录级别或块级别。 Note that here block means a buffer in using sequence file and not the hdfs block. 注意，这里的块表示使用序列文件而不是hdfs块的缓冲区。

If it is block compression then multiple records are compressed into a block at once. 如果是块压缩，则立即将多个记录压缩到块中。 Records are added to a block until it reaches a minimum size in bytes. 记录将添加到块中，直到达到最小字节数。 The maximum size of input data to be compressed at a time is calculated by subtracting the maximum overhead of the compression algorithm from the buffer size. 通过从缓冲区大小中减去压缩算法的最大开销来计算一次要压缩的输入数据的最大大小。 The default buffer size is 512 bytes and for compression overhead it's 18 bytes(1% of bufferSize + 12 bytes) for zlib algorithm. 默认缓冲区大小为512字节，对于压缩开销，zlib算法为18字节（bufferSize的1％+ 12字节）。 Then a BlockCompressorStream is created with given output-stream and compressor and the compressed data gets written. 然后使用给定的输出流和压缩器创建BlockCompressorStream，并写入压缩数据。

Hope this answers the question to some extent. 希望这能在一定程度上回答这个问题。

I thought I would add a little more detail to Tariq's answer, by explaining where compression fits into the mapreduce pipeline at a higher level. 我想我会通过解释压缩在更高级别的mapreduce管道中的位置来为Tariq的答案添加更多细节。 Hopefully it is helpful. 希望它是有帮助的。

If you specify compression for the map stage ( mapreduce.map.output.compress=true ) the intermediate map output data will be compressed using whatever codec you've specified ( mapreduce.map.ouput.compress.codec=org.apache.hadoop.io.compress.* ) and saved to disk at the conclusion of each map task's completion (or earlier if the map task exceeds it serialization buffer limit, and begins to spill to the disk). 如果为地图阶段指定压缩（ mapreduce.map.output.compress=true ），则将使用您指定的任何编解码器压缩中间地图输出数据（ mapreduce.map.ouput.compress.codec=org.apache.hadoop.io.compress.* ）并在每个map任务完成时保存到磁盘（如果map任务超过了序列化缓冲区限制，则更早，并开始溢出到磁盘）。 The compressed data is then read from the disk and sent to the appropriate nodes during the Shuffle & Sort stage of your mapreduce job. 然后，在mapreduce作业的Shuffle＆Sort阶段，从磁盘读取压缩数据并将其发送到相应的节点。

At this stage (map output) the compression resultant has no benefit in being splittable, so the GZIP or Snappy codec are worth trying here as well as LZO and BZIP2. 在这个阶段（映射输出），压缩结果在可拆分方面没有任何好处，因此GZIP或Snappy编解码器在这里以及LZO和BZIP2都值得尝试。 GZIP typically has better compression ratios for most data but heavily consumes the CPU, whilst Snappy is faster with a lower compression ratio(ie it either has less latency or doesn't consume the CPU as heavily as GZIP…I'm not positive on the reason). GZIP通常对大多数数据具有更好的压缩比，但是大量消耗CPU，而Snappy在压缩比较低的情况下速度更快（即它具有更少的延迟或者不像GZIP那样消耗CPU ......我对此并不积极原因）。 Using data generated from teragen, the compression ratio of GZIP vs Snappy is 3.5x and 2.5x respectively. 使用teragen生成的数据，GZIP与Snappy的压缩比分别为3.5倍和2.5倍。 Obviously, your data and your hardware limitations will dictate what the most beneficial codec is in your situation. 显然，您的数据和硬件限制将决定您最有利的编解码器。

Compression before the the shuffle & sort stage is helpful in that it reduces disk IO, and reduces network bandwidth since you're sending the data compressed across the wire. 在shuffle＆sort阶段之前进行压缩有助于减少磁盘IO，并减少网络带宽，因为您要通过线路压缩数据。 I can't think of a good reason to not compress data at this stage so long as the CPU resources to do so are not being contended for. 我想不出在这个阶段不压缩数据的好理由，只要没有争用这样做的CPU资源。 In my little 10 node Hadoop cluster running on a 1 Gb network turning on compression for just the map output phase (ie the intermediate map data before the shuffle & sort stage was compressed; the final output was not compressed) improved the overall job time of a 100GB terasort job by 41% (GZIP) , and 45%(Snappy) versus not using compression. 在我的小10节点上，在1 Gb网络上运行的Hadoop集群仅针对地图输出阶段打开压缩（即，在混洗和排序阶段被压缩之前的中间地图数据;最终输出未被压缩）改善了整个作业时间一个100GB的terasort工作， 41％（GZIP） ， 45％（Snappy）与不使用压缩。 The data in those experiments was generated using teragen. 这些实验中的数据是使用teragen生成的。 Your results will vary based on your data set, hardware, and network, of course. 当然，您的结果将根据您的数据集，硬件和网络而有所不同。

The compressed data is then decompressed at the start of the reduce phase. 然后在缩减阶段开始时对压缩数据进行解压缩。

Compression comes into play once more at the end of the reduce phase for the final output (mapreduce.output.fileoutputformat.compress=true). 压缩在最终输出的reduce阶段结束时再次发挥作用（mapreduce.output.fileoutputformat.compress = true）。 Here is where splittable LZO or BZIP2 compression might be useful if you are feeding the output into another mapreduce job. 如果您将输出提供给另一个mapreduce作业，则可以使用可拆分LZO或BZIP2压缩。 If you don't use a splittable compression codec on the output and run a job on that data, only a single mapper can be used which defeats the one of the major benefits of Hadoop; 如果您不在输出上使用可拆分压缩编解码器并对该数据运行作业，则只能使用一个映射器，这会破坏Hadoop的主要优点之一; parallelization. 并行化。 One way to get around this, and use something like the GZIP codec, is to create a sequence file for the output. 解决这个问题并使用类似GZIP编解码器的方法之一是为输出创建一个序列文件。 A sequence file is splittable because it is essentially a series of compressed files appended together. 序列文件是可拆分的，因为它本质上是一系列附加在一起的压缩文件。 The sequence file is splittable at the boundaries where each file is appended to the other. 序列文件可在每个文件附加到另一个文件的边界处拆分。