Python - 并行解压缩.gz文件

Question

I have multiple .gz files that add up to 1TB in total. 我有多个.gz文件，总共加起来1TB。 How can I utilize Python 2.7 to unzip these files in parallel? 如何利用Python 2.7并行解压缩这些文件？ looping on the files takes too much time. 循环文件需要花费太多时间。

I tried this code as well: 我也试过这段代码：

filenames = [gz for gz in glob.glob(filesFolder + '*.gz')]

def uncompress(path):
    with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
        shutil.copyfileobj(src, dest)

with multiprocessing.Pool() as pool:
    for _ in pool.imap_unordered(uncompress, filenames, chunksize=1):
        pass

However I get the following error: 但是我收到以下错误：

  with multiprocessing.Pool() as pool:

AttributeError: __exit__

Thanks! 谢谢！

Answer 1

To use with construct, the object used inside must have __enter__ and __exit__ methods. 要with构造一起使用，内部使用的对象必须具有__enter__和__exit__方法。 The error says that the Pool class (or instance) doesn't have these so you can't use it in the with statement. 该错误表明Pool类（或实例）没有这些，因此您无法在with语句中使用它。 Try this (just removed the with statement): 试试这个（刚刚删除了with语句）：

import glob, multiprocessing, shutil

filenames = [gz for gz in glob.glob('.' + '*.gz')]

def uncompress(path):
    with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
        shutil.copyfileobj(src, dest)


for _ in multiprocessing.Pool().imap_unordered(uncompress, filenames, chunksize=1):
    pass

EDIT 编辑

I agree with @dhke, unless all (or most) of gz files are physically located adjacently, frequent disk reads for different locations (which are called more frequently when using multiprocessing) will be slower as compared to doing these operations file by file one by one (serially). 我同意@dhke，除非所有（或大多数）gz文件在物理上相邻，对于不同位置（在使用多处理时更频繁地调用）的频繁磁盘读取将比通过文件1执行这些操作文件更慢一个（连续）。

Python - 并行解压缩.gz文件

问题描述

1 个解决方案

解决方案1
0 2016-03-02 08:20:41

Python - 并行解压缩.gz文件

问题描述

1 个解决方案

解决方案1 0 2016-03-02 08:20:41

解决方案1
0 2016-03-02 08:20:41