有效地从包含大量python文件的zip中读取一个文件

Question

I am storing an index in a compressed zip on disk and wanted to extract a single file from this zip.我将索引存储在磁盘上的压缩 zip 中，并希望从此 zip 中提取单个文件。 Doing this in python seems to be very slow, is it possible to solve this.在python中这样做似乎很慢，是否有可能解决这个问题。

with zipfile.ZipFile("testoutput/index_doc.zip", mode='r') as myzip:
    with myzip.open("c0ibtxf_i.txt") as mytxt:
        txt = mytxt.read()
        txt = codecs.decode(txt, "utf-8")
        print(txt)

Is the python code I use.是我使用的python代码。 Running this script in python takes a noticably long time在 python 中运行这个脚本需要很长时间

python3 testunzip.py  1.22s user 0.06s system 98% cpu 1.303 total

Which is annoying, especially since I know it can go much faster:这很烦人，尤其是因为我知道它可以运行得更快：

unzip -p testoutput/index_doc.zip c0ibtxf_i.txt  0.01s user 0.00s system 69% cpu 0.023 total

as per request: profiling根据要求：分析

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.051    0.051    1.492    1.492 <string>:1(<module>)
127740    0.043    0.000    0.092    0.000 cp437.py:14(decode)
     1    0.000    0.000    1.441    1.441 testunzip.py:69(toprofile)
     1    0.000    0.000    0.000    0.000 threading.py:72(RLock)
     1    0.000    0.000    0.000    0.000 utf_8.py:15(decode)
     1    0.000    0.000    0.000    0.000 zipfile.py:1065(__enter__)
     1    0.000    0.000    0.000    0.000 zipfile.py:1068(__exit__)
     1    0.692    0.692    1.441    1.441 zipfile.py:1085(_RealGetContents)
     1    0.000    0.000    0.000    0.000 zipfile.py:1194(getinfo)
     1    0.000    0.000    0.000    0.000 zipfile.py:1235(open)
     1    0.000    0.000    0.000    0.000 zipfile.py:1591(__del__)
     2    0.000    0.000    0.000    0.000 zipfile.py:1595(close)
     2    0.000    0.000    0.000    0.000 zipfile.py:1713(_fpclose)
     1    0.000    0.000    0.000    0.000 zipfile.py:191(_EndRecData64)
     1    0.000    0.000    0.000    0.000 zipfile.py:234(_EndRecData)
127739    0.180    0.000    0.220    0.000 zipfile.py:320(__init__)
127739    0.046    0.000    0.056    0.000 zipfile.py:436(_decodeExtra)
     1    0.000    0.000    0.000    0.000 zipfile.py:605(_check_compression)
     1    0.000    0.000    0.000    0.000 zipfile.py:636(_get_decompressor)
     1    0.000    0.000    0.000    0.000 zipfile.py:654(__init__)
     3    0.000    0.000    0.000    0.000 zipfile.py:660(read)
     1    0.000    0.000    0.000    0.000 zipfile.py:667(close)
     1    0.000    0.000    0.000    0.000 zipfile.py:708(__init__)
     1    0.000    0.000    0.000    0.000 zipfile.py:821(read)
     1    0.000    0.000    0.000    0.000 zipfile.py:854(_update_crc)
     1    0.000    0.000    0.000    0.000 zipfile.py:901(_read1)
     1    0.000    0.000    0.000    0.000 zipfile.py:937(_read2)
     1    0.000    0.000    0.000    0.000 zipfile.py:953(close)
     1    0.000    0.000    1.441    1.441 zipfile.py:981(__init__)
127740    0.049    0.000    0.049    0.000 {built-in method _codecs.charmap_decode}
     1    0.000    0.000    0.000    0.000 {built-in method _codecs.decode}
     1    0.000    0.000    0.000    0.000 {built-in method _codecs.utf_8_decode}
127743    0.058    0.000    0.058    0.000 {built-in method _struct.unpack}
127739    0.016    0.000    0.016    0.000 {built-in method builtins.chr}
     1    0.000    0.000    1.492    1.492 {built-in method builtins.exec}
     1    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
     2    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
255484    0.020    0.000    0.020    0.000 {built-in method builtins.len}
     1    0.000    0.000    0.000    0.000 {built-in method builtins.max}
     1    0.000    0.000    0.000    0.000 {built-in method builtins.min}
     1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
     1    0.000    0.000    0.000    0.000 {built-in method io.open}
     2    0.000    0.000    0.000    0.000 {built-in method zlib.crc32}
     1    0.000    0.000    0.000    0.000 {function ZipExtFile.close at 0x101975620}
127741    0.011    0.000    0.011    0.000 {method 'append' of 'list' objects}
     1    0.000    0.000    0.000    0.000 {method 'close' of '_io.BufferedReader' objects}
127740    0.224    0.000    0.317    0.000 {method 'decode' of 'bytes' objects}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
127739    0.024    0.000    0.024    0.000 {method 'find' of 'str' objects}
     1    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
     7    0.006    0.001    0.006    0.001 {method 'read' of '_io.BufferedReader' objects}
510956    0.071    0.000    0.071    0.000 {method 'read' of '_io.BytesIO' objects}
     8    0.000    0.000    0.000    0.000 {method 'seek' of '_io.BufferedReader' objects}
     4    0.000    0.000    0.000    0.000 {method 'tell' of '_io.BufferedReader' objects}

it seems to be something that happens in the constructor?这似乎是在构造函数中发生的事情？ Can I avoid this overhead somehow?我可以以某种方式避免这种开销吗？

Answer 1

I figured out what the problem was:我想出了问题所在：

pythons zipfile library builds a list of information object for each file in the zip pythons zipfile 库为zip中的每个文件构建一个信息对象列表
this causes zipfile to be quite fast once it's loaded.这会导致 zipfile 加载后速度非常快。
but when there are a lot of files in the zip and you only need a small portion of this files each time you load the zip, the overhead of creating the info-list costs a lot of time.但是当 zip 中有很多文件并且每次加载 zip 时只需要这些文件的一小部分时，创建信息列表的开销会花费很多时间。

To solve this, I adapted the source of python's zipfile.为了解决这个问题，我改编了python的zipfile的源码。 It has all the default functionalities you need, but when you give the constructor a list of the filenames to extract, it will not build the entire information list.它具有您需要的所有默认功能，但是当您向构造函数提供要提取的文件名列表时，它不会构建整个信息列表。

In the particular use case that you only need a few files from a zip, this will make a big difference in performance and memory usage.在您只需要 zip 中的几个文件的特定用例中，这将对性能和内存使用产生很大影响。

for the particular case in the example above (namely extracting only one file from a zip containing 128K files, the speed of the new implementation now approaches the speed of the unzip method)对于上例中的特殊情况（即仅从包含 128K 文件的 zip 中提取一个文件，新实现的速度现在接近解压缩方法的速度）

A test case:一个测试用例：

def original_zipfile(): 
    import zipfile  
    with zipfile.ZipFile("testoutput/index_doc.zip", mode='r') as myzip:
        with myzip.open("c6kn5pu_i.txt") as mytxt:
            txt = mytxt.read()

def my_zipfile():   
    import zipfile2
    with zipfile2.ZipFile("testoutput/index_doc.zip", to_extract=["c6kn5pu_i.txt"], mode='r') as myzip:
        with myzip.open("c6kn5pu_i.txt") as mytxt:
            txt = mytxt.read()


if __name__ == "__main__":
    import time

    time1 = time.time() 
    original_zipfile()
    print("running time of original_zipfile = "+str(time.time()-time1))
    time1 = time.time() 
    my_zipfile()
    print("running time of my_new_zipfile   = "+str(time.time()-time1))

    print(myStopwatch.getPretty())

resulted in the following time readings导致以下时间读数

running time of original_zipfile = 1.0871901512145996
running time of my_new_zipfile   = 0.07036209106445312

I will include the source code, but notice that there are 2 small flaws to my implementation (once you give an extract list, when you don't the behaviour will be the same as mentioned before):我将包含源代码，但请注意我的实现有 2 个小缺陷（一旦你给出了一个提取列表，当你不提供时，行为将与之前提到的相同）：

it assumes all filenames to be encoded in the same encoding (which is an optimisation I included for my own purposes)它假定所有文件名都以相同的编码进行编码（这是我为自己的目的而包含的优化）
other functionality might be altered (for example extract_all might fail or only extract the files you gave to the the constructor)其他功能可能会改变（例如，extract_all 可能会失败或仅提取您提供给构造函数的文件）

github link github链接

有效地从包含大量python文件的zip中读取一个文件

问题描述

1 个解决方案

解决方案1
7 已采纳 2016-05-10 20:56:36

有效地从包含大量python文件的zip中读取一个文件

问题描述

1 个解决方案

解决方案1 7 已采纳 2016-05-10 20:56:36

解决方案1
7 已采纳 2016-05-10 20:56:36