简体繁体 English

Java ZipInputStream 跳过未使用的 ZipEntry 内容，而不是耗尽它

[英]Java ZipInputStream skipping unused ZipEntry content, rather than draining it

原文 2021-04-28 16:16:31 6 1 java/ zip/ inputstream

I'm trying to achieve an optimal reading of a ZipEntry content from zip.我试图从 zip 中实现对ZipEntry内容的最佳读取。 To achieve such I need the standard ZipInputStream to use InputStream.skip for not needed entry content rather than draining it.为了实现这一点，我需要标准的ZipInputStream来使用InputStream.skip来获取不需要的条目内容，而不是耗尽它。

As long as I understand from ZIP (file format) wiki:只要我从Z4348F938BDDDD8475E967CCB47ECB234Z（文件格式） wiki 中了解：

Because the files in a ZIP archive are compressed individually it is possible to extract them, or add new ones, without applying compression or decompression to the entire archive.由于 ZIP 存档中的文件是单独压缩的，因此可以提取它们或添加新文件，而无需对整个存档应用压缩或解压缩。 This contrasts with the format of compressed tar files, for which such random-access processing is not easily possible.这与压缩 tar 文件的格式形成对比，这种随机访问处理不容易实现。

From this I assume that skipping not needed content is deterministic before uncompressing the entry's content using ZIP.由此，我假设在使用 ZIP 解压缩条目内容之前，跳过不需要的内容是确定性的。

I however see that both ZipInputStream (Java standard) and ZipArchiveInputStream (apache) are draining the stream until the next entry rather than skipping it, which makes my use of it super inefficient.然而，我看到ZipInputStream （Java 标准）和ZipArchiveInputStream （apache）都在耗尽 stream 直到下一个条目，而不是跳过它，这使得我使用它的效率非常低。

I'm not completely aware of ZIP specification and seeing such a behavior of two majorly used ZIP APIs makes me think that it might be impossible.我并不完全了解 ZIP 规范，看到两个主要使用的 ZIP API 的这种行为让我认为这可能是不可能的。

Is it my understanding incorrect and such optimal behavior is not possible or what Java API do you suggest for reading Zip entries efficiently?是我的理解不正确并且这种最佳行为是不可能的，还是您建议有效地阅读 Zip 条目是什么？

1 个解决方案

The problem here is that ZipInputStream is a stream.这里的问题是ZipInputStream是 stream。 You start by reading the LOC (local file header) for the first entry, read the entry (decompress, checksum, etc.), repeat until no more entries (or LOCs rather).您首先读取第一个条目的 LOC（本地文件头），读取条目（解压缩，校验和等），重复直到没有更多条目（或 LOC）。

The end of the file/stream contains the directory for the whole zip contents for random access (or displaying zip file structure).文件/流的末尾包含整个 zip 内容的目录，用于随机访问（或显示 zip 文件结构）。 When streaming data, you can't access the end of the stream.流式传输数据时，无法访问 stream 的末尾。 So even if you could seek, you wouldn't know where to seek to.所以即使你可以寻求，你也不知道该去哪里寻求。 You have to decompress to know when the data for the entry ends, then you get the LOC for the next entry and so on.您必须解压缩才能知道条目的数据何时结束，然后获得下一个条目的 LOC，依此类推。

In this duplicate it's said that the only source of truth is the central directory, so we can't rely on compressed size of an entry for skipping anyway.在这个副本中，据说唯一的事实来源是中央目录，所以我们不能依赖条目的压缩大小来跳过。