简体   繁体   English

如何使用 Python 实现并行 gzip 压缩?

[英]How does one achieve parallel gzip compression with Python?

Big file compression with python gives a very nice example on how to use eg bz2 to compress a very large set of files (or a big file) purely in Python. Big file compression with python给出了一个很好的例子,说明如何使用例如 bz2 纯粹在 Python 中压缩一个非常大的文件集(或一个大文件)。

pigz says you can do better by exploiting parallel compression. pigz说你可以通过利用并行压缩做得更好。 To my knowledge (and Google search) insofar I cannot find a Python equivalent to do so in pure Python code.据我所知(和谷歌搜索),我在纯 Python 代码中找不到与此等效的 Python。

Is there a parallel Python implementation for pigz or equivalent?是否有针对pigz或等价物的并行 Python 实现?

I don't know of a pigz interface for Python off-hand, but it might not be that hard to write if you really need it.我不知道 Python 的pigz接口是什么,但如果你真的需要它,编写它可能并不难。 Python's zlib module allows compressing arbitrary chunks of bytes, and the pigz man page describes the system for parallelizing the compression and the output format already. Python 的zlib模块允许压缩任意字节块,而pigz手册页描述了用于并行压缩和输出格式的系统。

If you really need parallel compression, it should be possible to implement a pigz equivalent using zlib to compress chunks wrapped in multiprocessing.dummy.Pool.imap ( multiprocessing.dummy is the thread-backed version of the multiprocessing API, so you wouldn't incur massive IPC costs sending chunks to and from the workers) to parallelize the compression.如果您确实需要并行压缩,则应该可以使用zlib实现一个pigz等效项来压缩包装在multiprocessing.dummy.Pool.imap块( multiprocessing.dummymultiprocessing API 的线程支持版本,因此您不会产生大量的 IPC 成本,向工作人员发送数据块和从工作人员发送数据块)以并行化压缩。 Since zlib is one of the few built-in modules that releases the GIL during CPU-bound work, you might actually gain a benefit from thread based parallelism.由于zlib是在 CPU 密集型工作期间释放 GIL 的少数内置模块之一,因此您实际上可能会从基于线程的并行性中获益。

Note that in practice, when the compression level isn't turned up that high, I/O is often of similar (within order of magnitude or so) cost to the actual zlib compression;请注意,在实践中,当压缩级别没有调得那么高时,I/O 通常与实际zlib压缩的成本相似(在数量级左右); if your data source isn't able to actually feed the threads faster than they compress, you won't gain much from parallelizing.如果您的数据源实际上无法比压缩线程更快地提供线程,那么您将不会从并行化中获得太多收益。

mgzip is able to achieve this mgzip能够做到这一点

Using a block indexed GZIP file format to enable compress and decompress in parallel.使用块索引 GZIP 文件格式来启用并行压缩和解压缩。 This implement use 'FEXTRA' to record the index of compressed member, which is defined in offical GZIP file format specification version 4.3, so it is fully compatible with normal GZIP implementation.该工具使用'FEXTRA'来记录压缩成员的索引,在官方GZIP文件格式规范版本4.3中定义,所以它完全兼容普通的GZIP实现。

import mgzip

num_cpus = 0 # will use all available CPUs

with open('original_file.txt', 'rb') as original, mgzip.open(
    'gzipped_file.txt.gz', 'wb', thread=num_cpus, blocksize=2 * 10 ** 8
) as fw:
    fw.write(original.read())

I was able to speed up compression from 45min to 5min on a 72 CPUs server我能够在 72 个 CPU 的服务器上将压缩从 45 分钟加速到 5 分钟

You can use the flush() operation with Z_SYNC_FLUSH to complete the last deflate block and end it on a byte boundary.您可以使用带有Z_SYNC_FLUSHflush()操作来完成最后一个 deflate 块并在字节边界上结束它。 You can concatenate those to make a valid deflate stream, so long as the last one you concatenate is flushed with Z_FINISH (which is the default for flush() ).您可以将它们连接起来以形成有效的放气流,只要您连接的最后一个用Z_FINISH刷新(这是flush()的默认值)。

You would also want to compute the CRC-32 in parallel (whether for zip or gzip -- I think you really mean parallel gzip compression).您还需要并行计算 CRC-32(无论是 zip 还是 gzip——我认为您的意思是并行 gzip 压缩)。 Python does not provide an interface to zlib's crc32_combine() function. Python 不提供 zlib 的crc32_combine()函数的接口。 However you can copy the code from zlib and convert it to Python.但是,您可以从 zlib 复制代码并将其转换为 Python。 It will be fast enough that way, since it doesn't need to be run often.这样就足够快了,因为它不需要经常运行。 Also you can pre-build the tables you need to make it faster, or even pre-build a matrix for a fixed block length.您还可以预先构建所需的表以使其更快,甚至可以预先构建固定块长度的矩阵。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM