简体   繁体   中英

Efficient LZ4 multiple file compression using java

I took adrien grand's java repository providing JNI to the original LZ4 native code.

I want to compress multiple files under a given input directory, but LZ4 doesn't support multiple file compression like in java zip package so I tried another approach where I thought of to tar all my input files and pipe it as input to LZ4 compressor, and I used Jtar java package for taring all my input files. Is there any other better way other than this?

I came across many sample codes to compress some strings and how to correctly implement the LZ4 compressor and decompressor. Now I wanted to know how to actually implement it for multiple files? I also wanted to clarify whether I'm going in the correct direction.

After taring all files, according to sample code usage explanation I've to convert my tared file now to byte array to provide it to compressor module. I used apache-common-ioutil package for this purpose. So considering I've many files as input and which results in a tar of huge size, converting it always to byte array seems ineffective according to me. I wanted to first know whether this is effective or not? or is there any better way of using LZ4 package better than this?

Another problem that I came across was the end result. After compression of the tared files I would get an end result like MyResult.lz4 file as output but I was not able to decompress it using the archive manager ( I'm using ubuntu ) as it doesn't support this format. I'm also not clear about the archive and compression format that I have to use here. I also want to know what format should the end result be in. So now I'm speaking from an user point of view, consider a case where I'm generating a backup for the user if I provide him/her with traditional .zip, .gz or any known formats, the user would be in a position to decompress it by himself. As I know LZ4 doesn't mean I've to expect the user also should know such format right? He may even get baffled on seeing such a format. So this means a conversion from .lz4 to .zip format also seems meaningless. I already see the taring process of all my input files as a time consuming process, so I wanted to know how much it affects the performance. As I've seen in java zip package compressing multiple input files didn't seem to be a problem at all. So next to lz4 I came across Apache common compress and TrueZIP. I also came across several stack overflow links about them which helped me learn a lot. As of now I really wanted to use LZ4 for compression especially due it's performance but I came across these hurdles. Can anyone who has a good knowledge about LZ4 package provide solutions to all my queries and problems along with a simple implementation. Thanks.

Time I calculated for an input consisting of many files,
Time taken for taring : 4704 ms
Time taken for converting file to byte array : 7 ms
Time Taken for compression : 33 ms

Some facts:

  1. LZ4 is no different here than GZIP: it is a single-concern project, dealing with compression. It does not deal with archive structure. This is intentional.
  2. Adrien Grand's LZ4 lib produces output incompatible with the command-line LZ4 utility. This is also intentional.
  3. Your approach with tar seems OK becuase that's how it's done with GZIP.

Ideally you should make the tar code produce a stream which is immediately compressed instead of first being entirely stored in RAM. This is what is achieved at the command line using Unix pipes.

I had the same problem. The current release of LZ4 for Java is incompatible with the later developed LZ4 standard to handle streams, however, in the projects repo there is a patch that supports the standard to compress/decompress streams, and I can confirm it is compatible with the command line tool. You can find it here https://github.com/jpountz/lz4-java/pull/61 .

In Java you can use that together with TarArchiveInputStream from the Apache Commons compress.

If you want an example, the code I use is in the Maven artifact io.github.htools 0.27-SNAPSHOT (or at github) the classes io.github.htools.io.compressed.TarLz4FileWriter and (the obsolete class) io.github.htools.io.compressed.TarLz4File show how it works. In HTools, tar and lz4 are automatically used through ArchiveFile.getReader(String filename) and ArchiveFileWriter(String filename, int compressionlevel) provided your filename ends with .tar.lz4

You can chain IOStreams together, so using something like Tar Archive from Apache Commons and LZ4 from lz4-java,

try (LZ4FrameOutputStream outputStream = new LZ4FrameOutputStream(new FileOutputStream("path/to/myfile.tar.lz4"));
     TarArchiveOutputStream taos = new TarArchiveOutputStream (outputStream))  {

   ...

}

Consolidating the bytes into a byte array will cause a bottleneck as you are not trying to hold the entire stream in-memory which can easily run into OutOfMemory problems with large streams. Instead, you'll want to pipeline the bytes through all the IOStreams like above.

I created a Java library that does this for you https://github.com/spoorn/tar-lz4-java .

If you want to implement it yourself, here's a technical doc that includes details on how to LZ4 compress a directory using TarArchive from Apache Commons and lz4-java:https://github.com/spoorn/tar-lz4-java/blob/main/SUMMARY.md#lz4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM