简体繁体中英

What is lz4 split limit?

原文 2018-03-15 05:06:35 8 2 java/ scala/ apache-spark/ hdfs/ lz4

This question tells that lz4 compression format is splittable and suitable for using in hdfs. Ok I have compressed 1.5 Gb data into 300 Mb lz4 file. If I try to read this file via spark - what the maximum workers count can it create to read file in parallel? Do splittable pieces count depend on lz4 compression level?

2 answers

Compression will not impact the no of splittable pieces count

If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.

Compression codec that is splittable definitely matters and counts in Hadoop processing. I disagree with the previous answer. When you say splittable it essentially means you can have a mapper program that can read the logical split and process the data without worrying about the other parts of the split stored elsewhere in the datanode cluster with some compression algorithm.

For example, think about your windows zip file. If I had 10 GB file that and plan to zip with max size of split to be 100MB each then i create maybe 10 files of 100MB each (in total compressed to 1 GB). Can you write a program to process part of the file without unzipping the whole file back to its original state. This is the difference between splittable and unsplittable compression codec in hadoop context. For example, .gz is not splittable whereas bzip2 is possible. Even if you have a .gz file in Hadoop, you will have to first uncompresses the whole file across your datanode and then run program against the single file.This is not efficient and doesnt use power of Hadoop parallelism.

Lot of people confuse between splitting a compressed file into multiple parts in your windows or linux versus splitting a file in hadoop with compression codecs.

Lets come back to discussion of why compression with split matters. Hadoop essentially relies on mappers and reducers and each mapper can works upon the logical split of the file (not the physical block). If I had stored the file without splittablity then mapper will have to first uncompresses whole of the file before performing any operation on that record.

So be aware that input split is directly correlated with parallel processing in Hadoop.

Using LZ4 to Add to an existing .lz4 (zip) in Java

LZ4 file compression in Java

Decompressing byte[] using LZ4

Decompress compressed ubuntu lz4 file in Java

LZ4 library compatibility Issue: java and IOS

Using LZ4 Compression in Java for multiple files

LZ4 is not fast compared to deflater compressing string

LZ4 Compression (C++) and Decompression (Java)

How to correctly implement LZ4, Snappy or equivalent compression techniques in Java?

How to use LWJGL's LZ4 bindings to compress and decompress

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Using LZ4 to Add to an existing .lz4 (zip) in Java LZ4 file compression in Java Decompressing byte[] using LZ4 Decompress compressed ubuntu lz4 file in Java LZ4 library compatibility Issue: java and IOS Using LZ4 Compression in Java for multiple files LZ4 is not fast compared to deflater compressing string LZ4 Compression (C++) and Decompression (Java) How to correctly implement LZ4, Snappy or equivalent compression techniques in Java? How to use LWJGL's LZ4 bindings to compress and decompress

Related Tags

What is lz4 split limit?

Question

2 answers

solution1
0 2018-03-15 07:53:53

solution2
-1 2018-10-24 12:13:03

What is lz4 split limit?

Question

2 answers

solution1 0 2018-03-15 07:53:53

solution2 -1 2018-10-24 12:13:03

solution1
0 2018-03-15 07:53:53

solution2
-1 2018-10-24 12:13:03