简体   繁体   English

在Java中有效地读取zip文件

[英]Reading zip file efficiently in Java

I working on a project which works on a very large amount of data. 我正在开发一个可以处理大量数据的项目。 I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines). 我有很多(数千)zip文件,每个文件包含一个简单的txt文件,包含数千行(大约80k行)。 What I am currently doing is the following: 我目前正在做的是以下内容:

for(File zipFile: dir.listFiles()){
ZipFile zf = new ZipFile(zipFile);
ZipEntry ze = (ZipEntry) zf.entries().nextElement();
BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));
...

In this way I can read the file line by line, but it is definetely too slow. 通过这种方式,我可以逐行读取文件,但它的定义太慢了。 Given the large number of files and lines that need to be read, I need to read them in a more efficient way. 鉴于需要读取大量文件和行,我需要以更有效的方式阅读它们。

I have looked for a different approach, but I haven't been able to find anything. 我找了一个不同的方法,但我找不到任何东西。 What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don't know how to use them with zip files. 我认为我应该使用的是用于密集I / O操作的java nio API,但我不知道如何将它们与zip文件一起使用。

Any help would really be appreciated. 真的很感激任何帮助。

Thanks, 谢谢,

Marco 马尔科

I have a lot(thousands) of zip files. 我有很多(数千)的zip文件。 The zipped files are about 30MB each, while the txt inside the zip file is about 60/70 MB. 压缩文件各约30MB,而zip文件中的txt约为60/70 MB。 Reading and processing the files with this code takes a lot of hours, around 15, but it depends. 使用此代码读取和处理文件需要花费大量时间,大约15,但这取决于。

Let's do some back-of-the-envelope calculations. 让我们做一些背后的计算。

Let's say you have 5000 files. 假设您有5000个文件。 If it takes 15 hours to process them, this equates to ~10 seconds per file. 如果处理它们需要15个小时,这相当于每个文件约10秒。 The files are about 30MB each, so the throughput is ~3MB/s. 这些文件大约每个30MB,因此吞吐量约为3MB / s。

This is between one and two orders of magnitude slower than the rate at which ZipFile can decompress stuff. 这比ZipFile可以解压缩的速度慢一到两个数量级。

Either there's a problem with the disks (are they local, or a network share?), or it is the actual processing that is taking most of the time. 磁盘是否存在问题(它们是本地磁盘还是网络共享?),或者是大部分时间都在进行的实际处理。

The best way to find out for sure is by using a profiler. 确定的最佳方法是使用分析器。

The right way to iterate a zip file 迭代zip文件的正确方法

final ZipFile file = new ZipFile( FILE_NAME );
try
{
    final Enumeration<? extends ZipEntry> entries = file.entries();
    while ( entries.hasMoreElements() )
    {
        final ZipEntry entry = entries.nextElement();
        System.out.println( entry.getName() );
        //use entry input stream:
        readInputStream( file.getInputStream( entry ) )
    }
}
finally
{
    file.close();
}

private static int readInputStream( final InputStream is ) throws IOException {
    final byte[] buf = new byte[ 8192 ];
    int read = 0;
    int cntRead;
    while ( ( cntRead = is.read( buf, 0, buf.length ) ) >=0  )
    {
        read += cntRead;
    }
    return read;
}

Zip file consists of several entries, each of them has a field containing the number of bytes in the current entry. Zip文件由几个条目组成,每个条目都有一个包含当前条目中字节数的字段。 So, it is easy to iterate all zip file entries without actual data decompression. 因此,很容易迭代所有zip文件条目而无需实际的数据解压缩。 java.util.zip.ZipFile accepts a file/file name and uses random access to jump between file positions. java.util.zip.ZipFile接受文件/文件名,并使用随机访问在文件位置之间跳转。 java.util.zip.ZipInputStream, on the other hand, is working with streams, so it is unable to freely jump. 另一方面,java.util.zip.ZipInputStream正在使用流,因此它无法自由跳转。 That's why it has to read and decompress all zip data in order to reach EOF for each entry and read the next entry header. 这就是为什么它必须读取和解压缩所有zip数据以便为每个条目达到EOF并读取下一个条目标题。

What does it mean? 这是什么意思? If you already have a zip file in your file system – use ZipFile to process it regardless of your task. 如果您的文件系统中已有zip文件 - 无论您的任务如何,都使用ZipFile进行处理。 As a bonus, you can access zip entries either sequentially or randomly (with rather small performance penalty). 作为奖励,您可以按顺序或随机访问zip条目(性能损失相当小)。 On the other hand, if you are processing a stream, you'll need to process all entries sequentially using ZipInputStream. 另一方面,如果您正在处理流,则需要使用ZipInputStream按顺序处理所有条目。

Here is an example. 这是一个例子。 A zip archive (total file size = 1.6Gb) containing three 0.6Gb entries was iterated in 0.05 sec using ZipFile and in 18 sec using ZipInputStream. 使用ZipFile在0.05秒内迭代包含三个0.6Gb条目的zip存档(总文件大小= 1.6Gb),并使用ZipInputStream在18秒内迭代。

You can use the new file API like this: 您可以像这样使用新文件API:

Path jarPath = Paths.get(...);
try (FileSystem jarFS = FileSystems.newFileSystem(jarPath, null)) {
    Path someFileInJarPath = jarFS.getPath("/...");
    try (ReadableByteChannel rbc = Files.newByteChannel(someFileInJarPath, EnumSet.of(StandardOpenOption.READ))) {
        // read file
    }
}

The code is for jar files, but I think it should work for zips as well. 代码适用于jar文件,但我认为它也适用于拉链。

You can try this code 您可以尝试此代码

try
    {

        final ZipFile zf = new ZipFile("C:/Documents and Settings/satheesh/Desktop/POTL.Zip");

        final Enumeration<? extends ZipEntry> entries = zf.entries();
        ZipInputStream zipInput = null;

        while (entries.hasMoreElements())
        {
            final ZipEntry zipEntry=entries.nextElement();
            final String fileName = zipEntry.getName();
        // zipInput = new ZipInputStream(new FileInputStream(fileName));
            InputStream inputs=zf.getInputStream(zipEntry);
            //  final RandomAccessFile br = new RandomAccessFile(fileName, "r");
                BufferedReader br = new BufferedReader(new InputStreamReader(inputs, "UTF-8"));
                FileWriter fr=new FileWriter(f2);
            BufferedWriter wr=new BufferedWriter(new FileWriter(f2) );

            while((line = br.readLine()) != null)
            {
                wr.write(line);
                System.out.println(line);
                wr.newLine();
                wr.flush();
            }
            br.close();
            zipInput.closeEntry();
        }


    }
    catch(Exception e)
    {
        System.out.print(e);
    }
    finally
    {
        System.out.println("\n\n\nThe had been extracted successfully");

    }

this code works in a good manner. 这段代码工作得很好。

Intel has made an improved version of zlib , which Java uses internally peroform zip/unzip. 英特尔已经改进了zlib版本,Java使用内部peroform zip / unzip。 It requires you to patch zlib sources with Interl's IPP paches . 它要求您使用Interl的IPP缓存来修补zlib源。 I made a benchmark showing 1.4x to 3x gains in throughput. 我做了一个基准测试,显示吞吐量增加了1.4倍到3倍。

Asynchronous unpacking and synchronous processing 异步解包和同步处理

Using the advice from Java Performance , which is much like the answer from Wasim Wani , that from Satheesh Kumar : iterating over the ZIP entries to get the InputStream of each of them and them doing something about them, I built my own solution. 使用来自Java Performance的建议,这很像来自Wasim Wani答案 ,来自Satheesh Kumar :迭代ZIP条目以获得每个人的InputStream以及他们对它们做些什么,我构建了自己的解决方案。

In my case, the processing is the bottleneck, thus I massively launch parallel extracting at the beginning, iterating on the entries.hasMoreElements(), and place each of the results in a ConcurrentLinkedQueue that I consume from the processing thread. 在我的例子中,处理是瓶颈,因此我在开始时大量启动并行提取,迭代entries.hasMoreElements(),并将每个结果放在我从处理线程使用的ConcurrentLinkedQueue中。 My ZIP contains a collection of XML files representing serialized Java objects, so my "extracting" includes deserializing the objects, and those deserialized objects are the ones placed in the queue. 我的ZIP包含一组表示序列化Java对象的XML文件,因此我的“提取”包括反序列化对象,而那些反序列化的对象是放在队列中的对象。

For me, this has a few advantages compared to my previous approach of sequentially getting each file from the ZIP and processing it: 对我来说,与我之前从ZIP顺序获取每个文件并处理它的方法相比,这有一些优点:

  1. the more compelling one: 10% reduction in total time 更引人注目的一个:总时间减少10%
  2. the release of the file occurs earlier 文件的发布更早发生
  3. the whole amount of RAM is allocated quicker, so if there is not enough RAM it will fail faster (in a matter of tens of minutes instead of over one hour); 整个RAM的分配速度更快,因此如果没有足够的RAM,它将更快地失败(在几十分钟内而不是超过一小时); please note that the amount of memory I keep allocated after processing is quite similar to that occupied by the unzipped files, otherwise, it would be better to unzip and discard sequentially to keep the memory footprint lower 请注意,处理后保留的内存量与解压缩文件占用的内存量非常相似,否则,最好按顺序解压缩和丢弃,以减少内存占用
  4. unzipping and deserializing seems to have a high CPU usage, so the faster is finished, the faster you get your CPU for the processing, which is what really matters 解压缩和反序列化似乎具有较高的CPU使用率,因此完成得越快,CPU处理速度越快,这才是真正重要的

There is one disadvantage: the flow control is a little bit more complex when including parallelism. 有一个缺点:当包括并行性时,流控制稍微复杂一些。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM