简体   繁体   English

使用java进行高效的LZ4多文件压缩

[英]Efficient LZ4 multiple file compression using java

I took adrien grand's java repository providing JNI to the original LZ4 native code.我采用了 adrien grand 的 java 存储库,为原始 LZ4 本机代码提供了 JNI。

I want to compress multiple files under a given input directory, but LZ4 doesn't support multiple file compression like in java zip package so I tried another approach where I thought of to tar all my input files and pipe it as input to LZ4 compressor, and I used Jtar java package for taring all my input files.我想在给定的输入目录下压缩多个文件,但是 LZ4 不支持像 java zip 包中那样的多个文件压缩,所以我尝试了另一种方法,我想将所有输入文件 tar 并将其作为输入传递给 LZ4 压缩器,我使用 Jtar java 包对所有输入文件进行去皮重处理。 Is there any other better way other than this?除了这个还有其他更好的方法吗?

I came across many sample codes to compress some strings and how to correctly implement the LZ4 compressor and decompressor.我遇到了许多示例代码来压缩一些字符串以及如何正确实现 LZ4 压缩器和解压缩器。 Now I wanted to know how to actually implement it for multiple files?现在我想知道如何为多个文件实际实现它? I also wanted to clarify whether I'm going in the correct direction.我还想澄清我是否朝着正确的方向前进。

After taring all files, according to sample code usage explanation I've to convert my tared file now to byte array to provide it to compressor module.在对所有文件进行去皮重后,根据示例代码使用说明,我现在必须将去皮重文件转换为字节数组以将其提供给压缩器模块。 I used apache-common-ioutil package for this purpose.为此,我使用了 apache-common-ioutil 包。 So considering I've many files as input and which results in a tar of huge size, converting it always to byte array seems ineffective according to me.因此,考虑到我有很多文件作为输入,这会导致 tar 的大小非常大,因此在我看来,总是将其转换为字节数组似乎是无效的。 I wanted to first know whether this is effective or not?我想先知道这是否有效? or is there any better way of using LZ4 package better than this?或者有没有比这更好地使用 LZ4 包的更好方法?

Another problem that I came across was the end result.我遇到的另一个问题是最终结果。 After compression of the tared files I would get an end result like MyResult.lz4 file as output but I was not able to decompress it using the archive manager ( I'm using ubuntu ) as it doesn't support this format.压缩去皮重的文件后,我会得到像 MyResult.lz4 文件这样的最终结果作为输出,但我无法使用存档管理器(我正在使用 ubuntu )解压缩它,因为它不支持这种格式。 I'm also not clear about the archive and compression format that I have to use here.我也不清楚我必须在这里使用的存档和压缩格式。 I also want to know what format should the end result be in. So now I'm speaking from an user point of view, consider a case where I'm generating a backup for the user if I provide him/her with traditional .zip, .gz or any known formats, the user would be in a position to decompress it by himself.我也想知道最终结果应该是什么格式。所以现在我从用户的角度说,考虑一个我为用户生成备份的情况,如果我向他/她提供传统的 .zip , .gz 或任何已知格式,用户可以自行解压缩。 As I know LZ4 doesn't mean I've to expect the user also should know such format right?据我所知,LZ4 并不意味着我期望用户也应该知道这种格式,对吧? He may even get baffled on seeing such a format.看到这样的格式,他甚至可能会感到困惑。 So this means a conversion from .lz4 to .zip format also seems meaningless.所以这意味着从 .lz4 到 .zip 格式的转换似乎也毫无意义。 I already see the taring process of all my input files as a time consuming process, so I wanted to know how much it affects the performance.我已经将所有输入文件的去皮重过程视为一个耗时的过程,因此我想知道它对性能的影响有多大。 As I've seen in java zip package compressing multiple input files didn't seem to be a problem at all.正如我在 java zip 包中看到的那样,压缩多个输入文件似乎根本不是问题。 So next to lz4 I came across Apache common compress and TrueZIP.所以在 lz4 旁边,我遇到了 Apache common compress 和 TrueZIP。 I also came across several stack overflow links about them which helped me learn a lot.我还遇到了几个关于它们的堆栈溢出链接,这帮助我学到了很多东西。 As of now I really wanted to use LZ4 for compression especially due it's performance but I came across these hurdles.到目前为止,我真的很想使用 LZ4 进行压缩,尤其是因为它的性能,但我遇到了这些障碍。 Can anyone who has a good knowledge about LZ4 package provide solutions to all my queries and problems along with a simple implementation.任何对 LZ4 包有很好了解的人都可以为我的所有查询和问题提供解决方案以及简单的实现。 Thanks.谢谢。

Time I calculated for an input consisting of many files,我为包含许多文件的输入计算的时间,
Time taken for taring : 4704 ms去皮时间 : 4704 ms
Time taken for converting file to byte array : 7 ms将文件转换为字节数组所需的时间:7 毫秒
Time Taken for compression : 33 ms压缩时间:33 ms

Some facts:一些事实:

  1. LZ4 is no different here than GZIP: it is a single-concern project, dealing with compression. LZ4 在这里与 GZIP 没有什么不同:它是一个单一关注项目,处理压缩。 It does not deal with archive structure.它不处理归档结构。 This is intentional.这是故意的。
  2. Adrien Grand's LZ4 lib produces output incompatible with the command-line LZ4 utility. Adrien Grand 的 LZ4 库产生的输出与命令行 LZ4 实用程序不兼容。 This is also intentional.这也是故意的。
  3. Your approach with tar seems OK becuase that's how it's done with GZIP.您使用 tar 的方法似乎没问题,因为这就是使用 GZIP 的方式。

Ideally you should make the tar code produce a stream which is immediately compressed instead of first being entirely stored in RAM.理想情况下,您应该使 tar 代码生成一个立即压缩的流,而不是首先完全存储在 RAM 中。 This is what is achieved at the command line using Unix pipes.这是使用 Unix 管道在命令行中实现的。

I had the same problem.我有同样的问题。 The current release of LZ4 for Java is incompatible with the later developed LZ4 standard to handle streams, however, in the projects repo there is a patch that supports the standard to compress/decompress streams, and I can confirm it is compatible with the command line tool.当前版本的 LZ4 for Java 与后来开发的处理流的 LZ4 标准不兼容,但是,在项目 repo 中有一个补丁支持该标准来压缩/解压缩流,我可以确认它与命令行兼容工具。 You can find it here https://github.com/jpountz/lz4-java/pull/61 .你可以在这里找到它https://github.com/jpountz/lz4-java/pull/61

In Java you can use that together with TarArchiveInputStream from the Apache Commons compress.在 Java 中,您可以将它与来自 Apache Commons 压缩的 TarArchiveInputStream 一起使用。

If you want an example, the code I use is in the Maven artifact io.github.htools 0.27-SNAPSHOT (or at github) the classes io.github.htools.io.compressed.TarLz4FileWriter and (the obsolete class) io.github.htools.io.compressed.TarLz4File show how it works.如果你想要一个例子,我使用的代码在 Maven 工件 io.github.htools 0.27-SNAPSHOT (或在 github 上)类 io.github.htools.io.compressed.TarLz4FileWriter 和(过时的类) io.github .htools.io.compressed.TarLz4File 展示了它是如何工作的。 In HTools, tar and lz4 are automatically used through ArchiveFile.getReader(String filename) and ArchiveFileWriter(String filename, int compressionlevel) provided your filename ends with .tar.lz4在 HTools 中,如果您的文件名以 .tar.lz4 结尾,则通过 ArchiveFile.getReader(String filename) 和 ArchiveFileWriter(String filename, int compressionlevel) 自动使用 tar 和 lz4

You can chain IOStreams together, so using something like Tar Archive from Apache Commons and LZ4 from lz4-java,您可以将 IOStream 链接在一起,因此使用 Apache Commons 的 Tar Archive 和 lz4-java 的 LZ4 之类的东西,

try (LZ4FrameOutputStream outputStream = new LZ4FrameOutputStream(new FileOutputStream("path/to/myfile.tar.lz4"));
     TarArchiveOutputStream taos = new TarArchiveOutputStream (outputStream))  {

   ...

}

Consolidating the bytes into a byte array will cause a bottleneck as you are not trying to hold the entire stream in-memory which can easily run into OutOfMemory problems with large streams.将字节合并到字节数组中会导致瓶颈,因为您没有尝试将整个流保存在内存中,这很容易遇到大流的 OutOfMemory 问题。 Instead, you'll want to pipeline the bytes through all the IOStreams like above.相反,您需要像上面一样通过所有 IOStreams 管道传输字节。

I created a Java library that does this for you https://github.com/spoorn/tar-lz4-java .我创建了一个 Java 库来为您执行此操作https://github.com/spoorn/tar-lz4-java

If you want to implement it yourself, here's a technical doc that includes details on how to LZ4 compress a directory using TarArchive from Apache Commons and lz4-java:https://github.com/spoorn/tar-lz4-java/blob/main/SUMMARY.md#lz4如果您想自己实现它,这里有一个技术文档,其中包含有关如何使用来自 Apache Commons 的 TarArchive 和 lz4-java 的 LZ4 压缩目录的详细信息:https ://github.com/spoorn/tar-lz4-java/blob/主/SUMMARY.md#lz4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM