在java中读取一个巨大的Zip文件 - Out of Memory Error

Question

I am reading a ZIP file using java as below: 我正在使用java读取ZIP文件，如下所示：

Enumeration<? extends ZipEntry> zes=zip.entries();
    while(zes.hasMoreElements()) {
        ZipEntry ze=zes.nextElement();
        // do stuff..
    }

I am getting an out of memory error, the zip file size is about 160MB. 我收到内存不足错误，zip文件大小约为160MB。 The stacktrace is as below: 堆栈跟踪如下：

Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$1.<init>(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:197)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.zipFilePass2(DatToInsertDBBatch.java:250)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.processCompany(DatToInsertDBBatch.java:206)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.run(DatToInsertDBBatch.java:114)
at java.util.TimerThread.mainLoop(Timer.java:534)
at java.util.TimerThread.run(Timer.java:484)

How do I enumerate the contents of a big zip file without having increase my heap size? 如何在不增加堆大小的情况下枚举大型zip文件的内容？ Also when I dont enumerate the contents and just access a single file like this: 此外，当我不枚举内容，只是访问这样的单个文件：

ZipFile zip=new ZipFile(zipFile);
ZipEntry ze=zip.getEntry("docxml.xml");

Then I dont get an out of memory error. 然后我没有得到内存不足的错误。 Why does this happen? 为什么会这样？ How does a Zip file handle zip entries? Zip文件如何处理zip条目？ The other option would be to use a ZIPInputStream. 另一种选择是使用ZIPInputStream。 Would that have a small memory footprint. 这会占用很少的内存吗？ I would need to run this code eventually on a micro EC2 instance on the Amazon cloud (613 MB RAM) 我需要最终在亚马逊云上的微EC2实例上运行此代码（613 MB RAM）

EDIT: providing more information on how I process the zip entries after I get them 编辑：提供有关我如何处理zip条目后的更多信息

Enumeration<? extends ZipEntry> zes=zip.entries();
    while(zes.hasMoreElements()) {
        ZipEntry ze=zes.nextElement();
        S3Object s3Object=new S3Object(bkp.getCompanyFolder()+map.get(ze.getName()).getRelativeLoc());
            s3Object.setDataInputStream(zip.getInputStream(ze));
            s3Object.setStorageClass(S3Object.STORAGE_CLASS_REDUCED_REDUNDANCY);
            s3Object.addMetadata("x-amz-server-side-encryption", "AES256");
            s3Object.setContentType(Mimetypes.getInstance().getMimetype(s3Object.getKey()));
            s3Object.setContentDisposition("attachment; filename="+FilenameUtils.getName(s3Object.getKey()));
            s3objs.add(s3Object);
    }

I get the zipinputstream from the zipentry and store that in the S3object. 我从zipentry获得zipinputstream并将其存储在S3object中。 I collect all the S3Objects in a list and then finally upload them to Amazon S3. 我收集列表中的所有S3Objects，然后最终将它们上传到Amazon S3。 For those who dont know Amazon S3, its a file storage service. 对于那些不了解Amazon S3的人来说，它是一个文件存储服务。 You upload the file via HTTP. 您通过HTTP上传文件。

I am thinking maybe since i collect all the individual inputstreams this is happening? 我想也许是因为我收集了所有这些正在发生的输入流？ Would it help if I batched it up? 如果我把它批量化它会有帮助吗？ Like a 100 inputstreams at a time? 像一次100输入流？ Or would it be better if I unzipped it first and then used the unzipped file to upload rather storing streams? 或者，如果我先解压缩然后使用解压缩文件上传而不是存储流，会不会更好？

Answer 1

It is very unlikley that you get an out of memory exception because of processing a ZIP file. 由于处理ZIP文件而导致内存不足异常非常不可思议。 The Java classes ZipFile and ZipEntry don't contain anything that could possibly fill up 613 MB of memory. Java类ZipFile和ZipEntry不包含任何可能填满613 MB内存的内容。

What could exhaust your memory is to keep the decompressed files of the ZIP archive in memory, or - even worse - keeping them as an XML DOM, which is very memory intensive. 可能会耗尽内存的是将ZIP存档的解压缩文件保留在内存中，或者 - 更糟糕的是 - 将它们保存为XML DOM，这是一个非常耗费内存的问题。

Switching to another ZIP library will hardly help. 切换到另一个ZIP库几乎没有帮助。 Instead, you should look into changing your code so that it processes the ZIP archive and the contained files like streams and only keeps a limited part of each file in memory at a time. 相反，您应该考虑更改代码，以便它处理ZIP存档和包含的文件（如流），并且每次只将每个文件的有限部分保留在内存中。

BTW: I would be nice if you could provide more information about the huge ZIP files (do they contain many small files or few large files?) and about what you do with each ZIP entry. 顺便说一句：如果你能提供关于巨大的 ZIP文件的更多信息（它们包含许多小文件或几个大文件吗？）以及你对每个ZIP条目的处理方式，我会很高兴。

Update: 更新：

Thanks for the additional information. 感谢您的附加信息。 It looks like you keep the contents of the ZIP file in memory (although it somewhat depends on the implementation of the S3Object class, which I don't know). 看起来你将ZIP文件的内容保存在内存中（尽管它在某种程度上取决于S3Object类的实现，我不知道）。

It's probably best to implement some sort of batching as you propose yourself. 你自己提出的建议最好是实施某种批处理。 You could for example add up the decompressed size of each ZIP entry and upload the files every time the total size exceeds 100 MB. 例如，您可以添加每个ZIP条目的解压缩大小，并在每次总大小超过100 MB时上载文件。

Answer 2

You're using ZipFile class now, as I see. 正如我所见，你现在正在使用ZipFile类。 Probably usage ZipInputStream will be a better option because it has 'closeEntry()' method which (as I hope) deallocates memory resources used by an entry. 可能使用ZipInputStream将是一个更好的选择，因为它有'closeEntry（）'方法（我希望）解除分配条目使用的内存资源。 But I haven't used it before, it's just a guess. 但我之前没有使用它，这只是猜测。

Answer 3

The default size of a JVM is 64MB. JVM的默认大小为64MB。 You need to specify a larger size on the command line. 您需要在命令行上指定更大的大小。 use the -Xmx switch. 使用-Xmx开关。 Eg -Xmx256m 例如-Xmx256m

Answer 4

Indeed, java.util.zip.ZipFile has a size() method, but doesn't provide a method to access entries by index. 实际上，java.util.zip.ZipFile有一个size（）方法，但不提供按索引访问条目的方法。 Perhaps you need to use a different ZIP library. 也许您需要使用不同的ZIP库。 As I remember, I used TrueZIP with rather large archives. 我记得，我使用了相当大的档案的TrueZIP 。

在java中读取一个巨大的Zip文件 - Out of Memory Error

问题描述

4 个解决方案

解决方案1
2 已采纳 2011-12-28 09:04:22

解决方案2
1 2011-12-28 08:26:12

解决方案3
0 2011-12-28 07:59:57

解决方案4
0 2011-12-28 08:22:50

在java中读取一个巨大的Zip文件 - Out of Memory Error

问题描述

4 个解决方案

解决方案1 2 已采纳 2011-12-28 09:04:22

解决方案2 1 2011-12-28 08:26:12

解决方案3 0 2011-12-28 07:59:57

解决方案4 0 2011-12-28 08:22:50

解决方案1
2 已采纳 2011-12-28 09:04:22

解决方案2
1 2011-12-28 08:26:12

解决方案3
0 2011-12-28 07:59:57

解决方案4
0 2011-12-28 08:22:50