简体   繁体   English

何时使用压缩

[英]When to use compression

The question is in the title - when is it good to use compression? 问题在标题中-使用压缩何时会好? By good I meen faster processing. 好的,我意味着更快的处理速度。

My pipeline consists of multiple MR jobs and intermediate results are stored in sequence files. 我的管道包含多个MR作业,中间结果存储在序列文件中。

The data is numeric - time series. 数据是数字-时间序列。 Also, it happens that output of one job has same size as the input. 同样,碰巧一个作业的输出与输入的大小相同。 So, the data transfered/stored can be large. 因此,传输/存储的数据可能很大。

I would like to know whether I can expect speedup due to compression, or it will take more time to compress/decompress data? 我想知道是否由于压缩会加速,还是需要更多时间压缩/解压缩数据?

It is almost always a good idea to enable compression of intermediate data with a fast codec (read snappy). 用快速编解码器(读快照)启用中间数据压缩几乎总是一个好主意。 You won't get penalized too much even if your data is uncompressible. 即使您的数据不可压缩,您也不会受到太多惩罚。

Compression doesn't affect your job as long as you are aware what you are trying to achieve, make sure your compressed data is splittable. 只要知道要实现的目标,压缩就不会影响您的工作,请确保压缩数据是可拆分的。 I found bzip2 format more convenient with compression ratio and CPU usages but better to do in-house testing with different formats on your data set. 我发现bzip2格式在压缩率和CPU使用率上更方便,但在数据集上使用不同格式进行内部测试更好。

Compression give two major benefits. 压缩有两个主要好处。

1) use less disk space while running mapreduce job (intermittent output and final output compressed). 1)在运行mapreduce作业时使用较少的磁盘空间(间歇输出和最终输出已压缩)。 2) Increase job performance since we are sending compressed data during shuffling phase across the cluster nodes. 2)提高作业性能,因为我们正在改组阶段在群集节点之间发送压缩数据。

Hope that will help. 希望对您有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM