简体   繁体   中英

What is lz4 split limit?

This question tells that lz4 compression format is splittable and suitable for using in hdfs. Ok I have compressed 1.5 Gb data into 300 Mb lz4 file. If I try to read this file via spark - what the maximum workers count can it create to read file in parallel? Do splittable pieces count depend on lz4 compression level?

Compression will not impact the no of splittable pieces count

If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.

Compression codec that is splittable definitely matters and counts in Hadoop processing. I disagree with the previous answer. When you say splittable it essentially means you can have a mapper program that can read the logical split and process the data without worrying about the other parts of the split stored elsewhere in the datanode cluster with some compression algorithm.

For example, think about your windows zip file. If I had 10 GB file that and plan to zip with max size of split to be 100MB each then i create maybe 10 files of 100MB each (in total compressed to 1 GB). Can you write a program to process part of the file without unzipping the whole file back to its original state. This is the difference between splittable and unsplittable compression codec in hadoop context. For example, .gz is not splittable whereas bzip2 is possible. Even if you have a .gz file in Hadoop, you will have to first uncompresses the whole file across your datanode and then run program against the single file.This is not efficient and doesnt use power of Hadoop parallelism.

Lot of people confuse between splitting a compressed file into multiple parts in your windows or linux versus splitting a file in hadoop with compression codecs.

Lets come back to discussion of why compression with split matters. Hadoop essentially relies on mappers and reducers and each mapper can works upon the logical split of the file (not the physical block). If I had stored the file without splittablity then mapper will have to first uncompresses whole of the file before performing any operation on that record.

So be aware that input split is directly correlated with parallel processing in Hadoop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM