简体   繁体   English

什么是处理大文件的最佳Python Zip模块?

[英]What Is The Best Python Zip Module To Handle Large Files?

EDIT: Specifically compression and extraction speeds. 编辑:特别是压缩和提取速度。

Any Suggestions? 有什么建议?

Thanks 谢谢

So I made a random-ish large zipfile: 所以我做了一个随机的大型zipfile:

$ ls -l *zip
-rw-r--r--  1 aleax  5000  115749854 Nov 18 19:16 large.zip
$ unzip -l large.zip | wc
   23396   93633 2254735

ie, 116 MB with 23.4K files in it, and timed things: 即116 MB,其中包含23.4K文件,以及定时的东西:

$ time unzip -d /tmp large.zip >/dev/null

real    0m14.702s
user    0m2.586s
sys         0m5.408s

this is the system-supplied commandline unzip binary -- no doubt as finely-tuned and optimized as a pure C executable can be. 这是系统提供的命令行解压缩二进制文件 - 毫无疑问,它与纯C可执行文件一样经过精细调整和优化。 Then (after cleaning up /tmp;-)...: 然后(清理/ tmp之后; - )...:

$ time py26 -c'from zipfile import ZipFile; z=ZipFile("large.zip"); z.extractall("/tmp")'

real    0m13.274s
user    0m5.059s
sys         0m5.166s

...and this is Python with its standard library - a bit more demanding of CPU time, but over 10% faster in real, that is, elapsed time. ......这是带有标准库的Python - 对CPU时间要求更高,但实际速度提高了10%,即经过的时间。

You're welcome to repeat such measurements of course (on your specific platform -- if it's CPU-poor, eg a slow ARM chip, then Python's extra demands of CPU time may end up making it slower -- and your specific zipfiles of interest, since each large zipfile will have a very different mix and quite possibly performance). 当然,欢迎重复此类测量(在您的特定平台上 - 如果它的CPU很差,例如慢速ARM芯片,那么Python对CPU时间的额外需求可能最终使其变慢 - 以及您感兴趣的特定zip文件,因为每个大型zipfile将有一个非常不同的混合,很可能性能)。 But what this suggests to me is that there isn't that much space to build a Python extension much faster than good old zipfile -- since Python using it beats the pure-C, system-included unzip!-) 但是,这对我来说是没有太多空间来构建Python扩展比快旧的zipfile快得多 - 因为Python使用它比纯C,系统包含的解压缩! - )

For handling large files without loading them into memory, use the new stream-based methods in Python 2.6's version of zipfile , such as ZipFile.open . 要处理大文件而不将它们加载到内存中,请在Python 2.6的zipfile版本中使用新的基于流的方法,例如ZipFile.open Don't use extract or extractall unless you have strongly sanitised the filenames in the ZIP. 除非您强烈清理了ZIP中的文件名,否则请勿使用extractextractall

(You used to have to read all the bytes into memory, or hack around it like zipstream ; this is now obsolete.) (您以前必须将所有字节read入内存,或者像zipstream一样破解它;现在已经过时了。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM