简体繁体 English

将随机访问文件透明地写入zip文件

[英]Writing a random access file transparently to a zip file

原文 2012-09-06 12:33:24 9 3 java/ zip/ random-access

I have a java application that writes a RandomAccessFile to the file system. 我有一个Java应用程序，它将RandomAccessFile写入文件系统。 It has to be a RAF because some things are not known until the end, where I then seek back and write some information at the start of the file. 它必须是RAF，因为有些事情直到最后才是未知的，然后我回头寻找并在文件的开头写入一些信息。

I would like to somehow put the file into a zip archive. 我想以某种方式将文件放入zip存档中。 I guess I could just do this at the end, but this would involve copying all the data that has been written so far. 我想我可以在最后完成此操作，但这将涉及复制到目前为止已写入的所有数据。 Since these files can potentially grow very large, I would prefer a way that somehow did not involve copying the data. 由于这些文件可能会变得非常大，因此我希望以某种方式不涉及复制数据。

Is there some way to get something like a "ZipRandomAccessFile", a la the ZipOutputStream which is available in the jdk? 是否可以通过某种方法来获取类似“ ZipRandomAccessFile”的内容，例如jdk中可用的ZipOutputStream？

It doesn't have to be jdk only, I don't mind taking in third party libraries to get the job done. 它不必只是jdk，我不介意使用第三方库来完成工作。

Any ideas or suggestions..? 有什么想法或建议吗？

3 个解决方案

Maybe you need to change the file format so it can be written sequentially. 也许您需要更改文件格式，以便可以顺序写入。

In fact, since it is a Zip and Zip can contain multiple entries, you could write the sequential data to one ZipEntry and the data known 'only at completion' to a separate ZipEntry - which gives the best of both worlds. 实际上，由于它是一个Zip，并且Zip可以包含多个条目，因此您可以将顺序数据写入一个ZipEntry ，并将“仅在完成时”已知的数据写入一个单独的ZipEntry ，这是两全其美的选择。

It is easy to write, not having to go back to the beginning of the large sequential chunk. 它很容易编写，而不必回到大型顺序块的开头。 It is easy to read - if the consumer needs to know the 'header' data before reading the larger resource, they can read the data in that zip entry before proceeding. 它很容易读取-如果消费者需要在读取较大的资源之前知道“标头”数据，则可以在继续操作之前读取该zip条目中的数据。

The way the DEFLATE format is specified, it only makes sense if you read it from the start. 指定DEFLATE格式的方式，只有从头开始阅读时才有意义。 So each time you'd seek back and forth, the underlying zip implementation would have to start reading the file from the start. 因此，每次您来回搜索时，底层的zip实现都必须从头开始读取文件。 And if you modify something, the whole file would have to be decompressed first (not just up to the modification point), the change applied to the decompressed data, then compress the whole thing again. 而且，如果您修改了某些内容，则必须首先对整个文件进行解压缩（不仅限于修改点），然后将更改应用于解压缩的数据，然后再次压缩整个内容。

To sum it up, ZIP/DEFLATE isn't the format for this. 综上所述，ZIP / DEFLATE不是此格式。 However, breaking your data up into smaller, fixed size files that are compressed individually might be feasible. 但是，将数据分成较小的固定大小的文件进行单独压缩可能是可行的。

The point of compression is to recognize redundancy in data (like some characters occurring more often or repeated patterns) and make the data smaller by encoding it without that redundancy. 压缩的重点是识别数据中的冗余（例如某些字符频繁出现或重复出现），并通过对没有冗余的数据进行编码来使数据更小。 This makes it infeasible to create a compression algorithm that would allow random access writing. 这使得创建允许随机访问写入的压缩算法不可行。 In particular: 尤其是：

You never know in advance how well a piece of data can be compressed. 您永远不会事先知道一条数据可以被压缩的程度。 So if you change some block of data, its compressed version will be most likely either longer or shorter. 因此，如果更改某些数据块，则其压缩版本很可能更长或更短。
As a compression algorithm process the data stream, it uses the knowledge accumulated so far (like discovered repeated patterns) to compress the data at its current position. 在压缩算法处理数据流时，它使用到目前为止积累的知识（如发现的重复模式）在其当前位置压缩数据。 So if you change something, the algorithm needs to re-compress everything from this change to the end. 因此，如果您进行了更改，则算法需要重新压缩从此更改到结束的所有内容。