简体繁体 English

什么是lz4分割限额？

[英]What is lz4 split limit?

原文 2018-03-15 05:06:35 1 2 java/ scala/ apache-spark/ hdfs/ lz4

This question tells that lz4 compression format is splittable and suitable for using in hdfs. 这个问题告诉我lz4压缩格式是可拆分的，适合在hdfs中使用。 Ok I have compressed 1.5 Gb data into 300 Mb lz4 file. 好的，我已经将1.5 Gb数据压缩到300 Mb lz4文件中。 If I try to read this file via spark - what the maximum workers count can it create to read file in parallel? 如果我尝试通过Spark读取此文件-并行读取文件可以创建多少个最大工人数？ Do splittable pieces count depend on lz4 compression level? 可拆分件数是否取决于lz4压缩级别？

2 个解决方案

Compression will not impact the no of splittable pieces count 压缩不会影响可拆分件数

If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. 如果输入文件被压缩，则将从HDFS读取的字节减少，这意味着读取数据的时间更少。 This time conservation is beneficial to the performance of job execution. 节省时间有利于执行工作。

Compression codec that is splittable definitely matters and counts in Hadoop processing. 可拆分的压缩编解码器绝对很重要，并且在Hadoop处理中很重要。 I disagree with the previous answer. 我不同意先前的答案。 When you say splittable it essentially means you can have a mapper program that can read the logical split and process the data without worrying about the other parts of the split stored elsewhere in the datanode cluster with some compression algorithm. 当您说可拆分时，它实际上意味着您可以拥有一个映射程序，该程序可以读取逻辑拆分并处理数据，而无需担心使用某种压缩算法将拆分的其他部分存储在datanode群集中的其他位置。

For example, think about your windows zip file. 例如，考虑一下您的Windows压缩文件。 If I had 10 GB file that and plan to zip with max size of split to be 100MB each then i create maybe 10 files of 100MB each (in total compressed to 1 GB). 如果我有10 GB的文件，并计划将最大拆分大小压缩为100MB，则我可能会创建10个100MB的文件（总压缩为1 GB）。 Can you write a program to process part of the file without unzipping the whole file back to its original state. 您可以编写一个程序来处理文件的一部分而无需将整个文件解压缩到其原始状态吗？ This is the difference between splittable and unsplittable compression codec in hadoop context. 这是hadoop上下文中可拆分和不可拆分压缩编解码器之间的区别。 For example, .gz is not splittable whereas bzip2 is possible. 例如，.gz是不可拆分的，而bzip2是可能的。 Even if you have a .gz file in Hadoop, you will have to first uncompresses the whole file across your datanode and then run program against the single file.This is not efficient and doesnt use power of Hadoop parallelism. 即使Hadoop中有一个.gz文件，您也必须首先在整个datanode上解压缩整个文件，然后针对单个文件运行程序，这效率不高并且不使用Hadoop并行功能。

Lot of people confuse between splitting a compressed file into multiple parts in your windows or linux versus splitting a file in hadoop with compression codecs. 在Windows或Linux中将压缩文件拆分为多个部分与使用压缩编解码器在hadoop中拆分文件之间，很多人感到困惑。

Lets come back to discussion of why compression with split matters. 让我们回到讨论为何使用分裂物质进行压缩的讨论。 Hadoop essentially relies on mappers and reducers and each mapper can works upon the logical split of the file (not the physical block). Hadoop本质上依赖于映射器和缩减器，并且每个映射器都可以在文件的逻辑拆分上工作（而不是物理块）。 If I had stored the file without splittablity then mapper will have to first uncompresses whole of the file before performing any operation on that record. 如果我存储的文件没有splittablity，则mapper将必须首先解压缩整个文件，然后对该记录执行任何操作。

So be aware that input split is directly correlated with parallel processing in Hadoop. 因此请注意，输入拆分与Hadoop中的并行处理直接相关。