简体   繁体   English

如何从Python中的zip文件中读取zip文件?

[英]How to read from a zip file within zip file in Python?

I have a file that I want to read that is itself zipped within a zip archive. 我有一个我想要阅读的文件,它本身是在zip存档中压缩的。 For example, parent.zip contains child.zip, which contains child.txt. 例如,parent.zip包含child.zip,其中包含child.txt。 I am having trouble reading child.zip. 我在阅读child.zip时遇到了麻烦。 Can anyone correct my code? 谁能纠正我的代码?

I assume that I need to create child.zip as a file-like object and then open it with a second instance of zipfile, but being new to python my zipfile.ZipFile(zfile.open(name)) is silly. 我假设我需要创建一个类似文件的对象的child.zip,然后用第二个zipfile实例打开它,但是对于python我是新的zipfile.ZipFile(zfile.open(name))是愚蠢的。 It raises a zipfile.BadZipfile: "File is not a zip file" on (independently validated) child.zip 它引发了一个zipfile.BadZip文件:“文件不是一个zip文件”on(独立验证)child.zip

import zipfile
with zipfile.ZipFile("parent.zip", "r") as zfile:
    for name in zfile.namelist():
        if re.search(r'\.zip$', name) is not None:
            # We have a zip within a zip
            with **zipfile.ZipFile(zfile.open(name))** as zfile2:
                    for name2 in zfile2.namelist():
                        # Now we can extract
                        logging.info( "Found internal internal file: " + name2)
                        print "Processing code goes here"

When you use the .open() call on a ZipFile instance you indeed get an open file handle. 当您在ZipFile实例上使用.open()调用时,您确实获得了一个打开的文件句柄。 However, to read a zip file, the ZipFile class needs a little more. 但是,要读取 zip文件, ZipFile类需要更多。 It needs to be able to seek on that file, and the object returned by .open() is not seekable in your case. 它需要能够在该文件上进行搜索 ,并且.open()返回的对象在您的情况下是不可.open() Only Python 3 (3.2 and up) produces a ZipExFile object that supports seeking (provided the underlying file handle for the outer zip file is seekable, and nothing is trying to write to the ZipFile object). 只有Python 3(3.2及更高版本)生成一个支持搜索的ZipExFile对象(前提是外部zip文件的底层文件句柄是可搜索的,并且没有任何东西试图写入ZipFile对象)。

The workaround is to read the whole zip entry into memory using .read() , store it in a BytesIO object (an in-memory file that is seekable) and feed that to ZipFile : 解决方法是使用读取整个拉链进入存储器.read()其存储在一个BytesIO对象(一个内存文件,它可搜索)和饲料,为ZipFile

from io import BytesIO

# ...
        zfiledata = BytesIO(zfile.read(name))
        with zipfile.ZipFile(zfiledata) as zfile2:

or, in the context of your example: 或者,在您的示例中:

import zipfile
from io import BytesIO

with zipfile.ZipFile("parent.zip", "r") as zfile:
    for name in zfile.namelist():
        if re.search(r'\.zip$', name) is not None:
            # We have a zip within a zip
            zfiledata = BytesIO(zfile.read(name))
            with zipfile.ZipFile(zfiledata) as zfile2:
                for name2 in zfile2.namelist():
                    # Now we can extract
                    logging.info( "Found internal internal file: " + name2)
                    print "Processing code goes here"

To get this to work with python33 (under windows but that might be unrelevant) i had to do : 为了使这与python33一起工作(在windows下但可能不相关)我必须这样做:

 import zipfile, re, io
    with zipfile.ZipFile(file, 'r') as zfile:
        for name in zfile.namelist():
            if re.search(r'\.zip$', name) != None:
                zfiledata = io.BytesIO(zfile.read(name))
                with zipfile.ZipFile(zfiledata) as zfile2:
                    for name2 in zfile2.namelist():
                        print(name2)

cStringIO does not exist so i used io.BytesIO cStringIO不存在所以我使用了io.BytesIO

Here's a function I came up with. 这是我想出的一个功能。 (Copied from here .) (从这里复制。)

def extract_nested_zipfile(path, parent_zip=None):
    """Returns a ZipFile specified by path, even if the path contains
    intermediary ZipFiles.  For example, /root/gparent.zip/parent.zip/child.zip
    will return a ZipFile that represents child.zip
    """

    def extract_inner_zipfile(parent_zip, child_zip_path):
        """Returns a ZipFile specified by child_zip_path that exists inside
        parent_zip.
        """
        memory_zip = StringIO()
        memory_zip.write(parent_zip.open(child_zip_path).read())
        return zipfile.ZipFile(memory_zip)

    if ('.zip' + os.sep) in path:
        (parent_zip_path, child_zip_path) = os.path.relpath(path).split(
            '.zip' + os.sep, 1)
        parent_zip_path += '.zip'

        if not parent_zip:
            # This is the top-level, so read from disk
            parent_zip = zipfile.ZipFile(parent_zip_path)
        else:
            # We're already in a zip, so pull it out and recurse
            parent_zip = extract_inner_zipfile(parent_zip, parent_zip_path)

        return extract_nested_zipfile(child_zip_path, parent_zip)
    else:
        if parent_zip:
            return extract_inner_zipfile(parent_zip, path)
        else:
            # If there is no nesting, it's easy!
            return zipfile.ZipFile(path)

Here's how I tested it: 这是我测试它的方式:

echo hello world > hi.txt
zip wrap1.zip hi.txt
zip wrap2.zip wrap1.zip
zip wrap3.zip wrap2.zip

print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap2.zip/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap3.zip/wrap2.zip/wrap1.zip').open('hi.txt').read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM