简体   繁体   English

在 Zip 中提取 PDF 在 Zip 中

[英]Extracting PDF inside a Zip inside a Zip

i have checked everywhere online and stackoverflow and could not find a match specific to this issue.我已经在网上和 stackoverflow 上到处查看,但找不到特定于此问题的匹配项。 I am trying to extract a pdf file that is located in a zip file that is inside a zip file (nested zips).我正在尝试提取 pdf 文件,该文件位于 zip 文件(嵌套 zips)内的 zip 文件中。 Re-calling the method i am using to extract does not work nor does changing the whole program to accept Inputstreams instead of how i am doing it below.重新调用我用来提取的方法不起作用,也不会更改整个程序以接受输入流,而不是我在下面的操作方式。 I am getting java.io.IOException: Stream Closed .我收到java.io.IOException: Stream Closed My code is below and the error line is indicated with the error message.我的代码在下面,错误行用错误消息指示。

public static void main(String[] args)
    {
        try
        {
            //Paths
            String basePath = "C:\\Users\\user\\Desktop\\Scan\\";
            File lookupDir = new File(basePath + "Data\\");
            String doneFolder = basePath + "DoneUnzipping\\";       
            
            File[] directoryListing = lookupDir.listFiles();
                
            for (int i = 0; i < directoryListing.length; i++) 
            {
                if (directoryListing[i].isFile()) //there's definately a file
                {
                    //Save the current file's path
                    String pathOrigFile = directoryListing[i].getAbsolutePath();
                    Path origFileDone = Paths.get(pathOrigFile);
                    Path newFileDone = Paths.get(doneFolder + directoryListing[i].getName());
                            
                    //unzip it
                    if(directoryListing[i].getName().toUpperCase().endsWith(ZIP_EXTENSION)) //ZIP files
                    {
                        unzip(directoryListing[i].getAbsolutePath(), DESTINATION_DIRECTORY + directoryListing[i].getName());
                            
                        //move to the 'DoneUnzipping' folder
                        Files.move(origFileDone, newFileDone);                            
                        }
                    } 
            }
        } catch (Exception e)
        {
            e.printStackTrace(System.out);
        }
    }
            
    private static void unzip(String zipFilePath, String destDir) 
    {        
        //buffer for read and write data to file
        byte[] buffer = new byte[BUFFER_SIZE];
        
        try {
                FileInputStream fis = new FileInputStream(zipFilePath);
                ZipInputStream zis = new ZipInputStream(fis);
                ZipEntry ze = zis.getNextEntry();
                
                while(ze != null)
                {
                    String fileName = ze.getName();
                    int index = fileName.lastIndexOf("/");
                    String newFileName = fileName.substring(index + 1);
                    File newFile = new File(destDir + File.separator + newFileName);
                    
                    //Zips inside zips  
                    if(fileName.toUpperCase().endsWith(ZIP_EXTENSION))
                    {                      
                        try(ZipInputStream innerZip = new ZipInputStream(fis)) 
                            {
                                ZipEntry innerEntry = null;
                                while((innerEntry = innerZip.getNextEntry()) != null) 
                                {
                                    System.out.println("The file: " + fileName);
                                    if(fileName.toUpperCase().endsWith("PDF")) 
                                    {
                                        FileOutputStream fos = new FileOutputStream(newFile);
                                        int len;
                                        while ((len = zis.read(buffer)) > 0) 
                                        {
                                            fos.write(buffer, 0, len);
                                        }
                                        fos.close();
                                    }
                                }
                            }

                    }
                    
                //close this ZipEntry
                zis.closeEntry(); // java.io.IOException: Stream Closed
                ze = zis.getNextEntry();                       
                
                }  
            
            //close last ZipEntry
            zis.close();
            fis.close();
        } catch (IOException e) 
        {
            e.printStackTrace();
        }
        
    }

The solution to this is not as obvious as it seems.这个问题的解决方案并不像看起来那么明显。 Despite writing a few zip utilities myself some time ago, getting zip entries from inside another zip file only seems obvious in retrospect尽管前段时间自己写了一些 zip 实用程序,但回想起来从另一个 zip 文件中获取 zip 条目似乎很明显
(and I also got the java.io.IOException: Stream Closed on my first attempt). (我也得到了java.io.IOException: Stream Closed on my first attempt)。

The Java classes for ZipFile and ZipInputStream really direct your thinking into using the file system, but it is not required. ZipFileZipInputStream的 Java 类确实引导您思考使用文件系统,但这不是必需的。

The functions below will scan a parent-level zip file, and continue scanning until it finds an entry with a specified name.下面的函数将扫描父级 zip 文件,并继续扫描直到找到具有指定名称的条目。 (Nearly) everything is done in-memory. (几乎)一切都在内存中完成。

Naturally, this can be modified to use different search criteria, find multiple file types, etc. and take different actions, but this at least demonstrates the basic technique in question -- zip files inside of zip files -- no guarantees on other aspects of the code, and someone more savvy could most likely improve the style.当然,这可以修改为使用不同的搜索条件、查找多种文件类型等并采取不同的操作,但这至少演示了所讨论的基本技术——zip 文件中的 zip 个文件——不能保证其他方面代码,更精明的人很可能会改进风格。

final static String ZIP_EXTENSION = ".zip";

public static byte[] getOnePDF() throws IOException
{
    final File source = new File("/path/to/MegaData.zip");
    final String nameToFind = "FindThisFile.pdf";

    final ByteArrayOutputStream mem = new ByteArrayOutputStream();

    try (final ZipInputStream in = new ZipInputStream(new BufferedInputStream(new FileInputStream(source))))
    {
        digIntoContents(in, nameToFind, mem);
    }

    // Save to disk, if you want
    // copy(new ByteArrayInputStream(mem.toByteArray()), new FileOutputStream(new File("/path/to/output.pdf")));

    // Otherwise, just return the binary data
    return mem.toByteArray();
}

private static void digIntoContents(final ZipInputStream in, final String nameToFind, final ByteArrayOutputStream mem) throws IOException
{
    ZipEntry entry;
    while (null != (entry = in.getNextEntry()))
    {
        final String name = entry.getName();

        // Found the file we are looking for
        if (name.equals(nameToFind))
        {
            copy(in, mem);
            return;
        }

        // Found another zip file
        if (name.toUpperCase().endsWith(ZIP_EXTENSION.toUpperCase()))
        {
            digIntoContents(new ZipInputStream(new ByteArrayInputStream(getZipEntryFromMemory(in))), nameToFind, mem);
        }
    }
}

private static byte[] getZipEntryFromMemory(final ZipInputStream in) throws IOException
{
    final ByteArrayOutputStream mem = new ByteArrayOutputStream();
    copy(in, mem);
    return mem.toByteArray();
}

// General purpose, reusable, utility function
// OK for binary data (bad for non-ASCII text, use Reader/Writer instead)
public static void copy(final InputStream from, final OutputStream to) throws IOException
{
    final int bufferSize = 4096;

    final byte[] buf = new byte[bufferSize];
    int len;
    while (0 < (len = from.read(buf)))
    {
        to.write(buf, 0, len);
    }
    to.flush();
}

The line that causes your problem looks to be auto-close block you have created when reading the inner zip:导致您的问题的行看起来是您在读取内部 zip 时创建的自动关闭块:

try(ZipInputStream innerZip = new ZipInputStream(fis)) {
   ...
}

Several likely issues: firstly it is reading the wrong stream - fis not the existing zis .几个可能的问题:首先它读错了 stream - fis不是现有的zis

Secondly, you shouldn't use try-with-resources for auto-close on innerZip as this implicitly calls innerZip.close() when exiting the block.其次,您不应在innerZip上使用 try-with-resources 自动关闭,因为这会在退出块时隐式调用innerZip.close() If you view the source code of ZipInputStream via a good IDE you should see (eventually) that ZipInputStream extends InflaterInputStream which itself extends FilterInputStream .如果您通过良好的 IDE 查看ZipInputStream的源代码,您应该(最终)看到ZipInputStream extends InflaterInputStream了 InflaterInputStream 本身extends FilterInputStream A call to innerZip.close() will close the underlying outer stream zis ( fis in your case) hence stream is closed when you resume the next entry of the outer zip.调用innerZip.close()将关闭底层外部fis zis (在您的情况下为 fis)因此当您恢复外部 zip 的下一个条目时 stream 将关闭。

Therefore remove the try() block and add use of zis :因此删除try()块并添加对zis的使用:

ZipInputStream innerZip = new ZipInputStream(zis);

Use try-catch block only for the outermost file handling:仅将 try-catch 块用于最外层的文件处理:

try (ZipInputStream zis = new ZipInputStream(new FileInputStream(zipFilePath))) {
    ZipEntry ze = zis.getNextEntry();
    ...
}

Thirdly, you appear to be copying the wrong stream when extracting a PDF - use innerZip not outer zis .第三,您在提取 PDF 时似乎复制了错误的 stream - 使用innerZip而不是外部zis You should be able to switch to one line Files.copy simply as:您应该能够简单地切换到一行Files.copy

if(fileName.toUpperCase().endsWith("PDF")) {
    Files.copy(innerZip, newFile.toPath());
}

Your question asks how to use java (by implication in windows) to extract a pdf from a zip inside another outer zip.您的问题询问如何使用 java(在 Windows 中暗示)从另一个外部 zip 中的 zip 中提取 pdf。

In many systems including windows it is a single line command that will depend on the location of source and target folders, however using the shortest example of current downloads folder it would be in a shell as simple as包括 windows 在内的许多系统中,它是一个单行命令,取决于源文件夹和目标文件夹的位置,但是使用当前下载文件夹的最短示例,它在 shell 中就像

tar -xf "german (2).zip" && tar -xf "german.zip" && german.pdf

to shell the command in windows see How do I execute Windows commands in Java?到 shell windows 中的命令 请参阅如何在 Java 中执行 Windows 命令?

The default pdf viewer can open the result so Windows Edge or in my case SumatraPDF默认的 pdf 查看器可以打开结果,所以 Windows Edge 或者在我的例子中是 SumatraPDF

在此处输入图像描述

There is generally no point in putting a pdf inside a zip because it cannot be run in there.通常将 pdf 放在 zip 中是没有意义的,因为它不能在那里运行。 So single nesting would be advisable if needed for download transportation.因此,如果需要下载传输,建议使用单个嵌套。

There is no need to add a password to the zip because PDF uses its own password for opening. zip不需要加密码,因为PDF是用自己的密码打开的。 Thus unwise to add two levels of complexity.因此,增加两个级别的复杂性是不明智的。 Keep it simple.把事情简单化。

If you have multiple zips nested inside multiple zips with multiple pdfs in each then you have to be more specific by filtering names.如果您有多个 zip 嵌套在多个 zip 中,每个 zip 中有多个 pdf,那么您必须通过过滤名称来更加具体。 However avoid that extra onion skin where possible.但是,尽可能避免使用额外的洋葱皮。

\Downloads>tar -xf "german (2).zip" "both.zip" && tar -xf "both.zip" "English language.pdf"

You could complicate that by run in a memory or temp folder but it is reliable and simple to use the native file system so consider without Java its fastest to run您可以通过在 memory 或临时文件夹中运行来使其复杂化,但使用本机文件系统既可靠又简单,因此请考虑在没有 Java 的情况下运行速度最快

CD /D "C:/Users/user/Desktop/Scan/DoneUnzipping" && for  %f in (..\Data\*.zip) do  tar -xf "%f" "*.zip" && for  %f in (*.zip) do  tar -xf "%f" "*.pdf" && del "*.zip"

This will extract all inner zips into working folder then extract all PDFs and remove all the essential temporary zips.这会将所有内部 zip 解压缩到工作文件夹中,然后解压缩所有 PDF 并删除所有必要的临时 zip。 The source double zips will not be deleted simply touched.源双拉链不会被简单地删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM