使用多处理读取，压缩，写入

Question

I'm compressing files. 我在压缩文件。 A single process is fine for a few of them, but I'm compressing thousands of them and this can (and has) taken several days, so I'd like to speed it up with multiprocessing. 一个过程对于其中一些是好的，但是我压缩了数千个，这可能（并且已经用了几天），所以我想通过多处理加速它。 I've read that I should avoid having multiple processes reading files at the same time, and I'm guessing I shouldn't have multiple processes writing at once as well. 我已经读过，我应该避免让多个进程同时读取文件，我猜我不应该同时写多个进程。 This is my current method that runs singly: 这是我目前单独运行的方法：

import tarfile, bz2, os
def compress(folder):
    "compresses a folder into a file"

    bz_file = bz2.BZ2File(folder+'.tbz', 'w')

    with tarfile.open(mode='w', fileobj = bz_file) as tar:

        for fn in os.listdir(folder):

            read each file in the folder and do some pre processing
            that will make the compressed file much smaller than without

            tar.addfile( processed file )

    bz_file.close()
    return

This is taking a folder and compressing all its contents into a single file. 这是一个文件夹并将其所有内容压缩到一个文件中。 This makes them easier to handle and more organized. 这使它们更容易处理和更有条理。 If I just tossed this into a pool, then I'd have several processes reading and writing all at once, so I want to avoid that. 如果我把它扔到一个池中，那么我会有几个进程同时读取和写入，所以我想避免这种情况。 I can rework it so only one process is reading the files but I still have multiple ones writing: 我可以重做它，所以只有一个进程正在读取文件，但我还有多个写入：

import multiprocessing as mp
import tarfile, bz2, os

def compress(file_list):
    folder = file_list[0]
    bz_file = bz2.BZ2File(folder+'.tbz', 'w')

    with tarfile.open(mode='w', fileobj = bz_file) as tar:

        for i in file_list[1:]:
            preprocess file data
            tar.addfile(processed data)

    bz_file.close()
    return

cpu_count = mp.cpu_count()
p = mp.Pool(cpu_count)

for subfolder in os.listdir(main_folder):

    read all files in subfolder into memory, place into file_list
    place file_list into fld_list until fld_list contains cpu_count
    file lists. then pass to  p.map(compress, fld_list)

This still has a number of processes writing compressed files at once. 这仍然有许多进程一次写入压缩文件。 Just the act of telling tarfile what kind of compression to use starts writing to the hard drive. 只是告诉tarfile使用什么样的压缩开始写入硬盘驱动器的行为。 I cannot read all the files I need to compress into memory as I don't have that amount of RAM to do so – so it also has the issue that I'm restarting Pool.map many times. 我无法读取我需要压缩到内存中的所有文件，因为我没有那么多RAM来执行此操作 - 所以它也存在我多次重启Pool.map的问题。

How can I read and write files in a single process, yet have all the compression in several processes, while avoiding restarting multiprocessing.Pool multiple times? 如何在单个进程中读写文件，但在多个进程中进行所有压缩，同时避免多次重启多处理.Pool？

Answer 1

Instead of using multiprocessing.Pool , one should use multiprocessing.Queue and create an inbox and an outbox. 不应使用multiprocessing.Pool ，而应使用multiprocessing.Queue并创建收件箱和发件箱。

Start a single process to read in the files and place the data into the inbox queue, and put a limit on the size of the queue so you don't end up filling your RAM. 启动单个进程以读入文件并将数据放入收件箱队列，并限制队列的大小，这样您就不会最终填满RAM。 The example here compresses single files, but it can be adjusted to handle whole folders at once. 此处的示例压缩单个文件，但可以调整它以一次处理整个文件夹。

def reader(inbox, input_path, num_procs):
    "process that reads in files to be compressed and puts to inbox"

    for fn in os.listdir(input_path):
        path = os.path.join(input_path, fn)

        # read in each file, put data into inbox
        fname = os.path.basename(fn)
        with open(fn, 'r') as src: lines = src.readlines()

        data = [fname, lines]
        inbox.put(data)

    # read in everything, add finished notice for all running processes
    for i in range(num_procs):
        inbox.put(None)  # when a compressor sees a None, it will stop
    inbox.close()
    return

But that's only half of the question, the other part is to compress the file without having to write it to disk. 但这只是问题的一半，另一部分是压缩文件而不必将其写入磁盘。 We give a StringIO object to the compression function instead of an open file; 我们将StringIO对象赋予压缩函数而不是打开文件; it is passed to tarfile . 它被传递给tarfile 。 Once compressed we put the StringIO object into the outbox queue. 压缩后，我们将StringIO对象放入发件箱队列。

Except we can't do that, because StringIO objects can't be pickled, only pickleable objects can go into a queue. 除非我们不能这样做，因为StringIO对象不能被pickle，只有pickleable对象可以进入队列。 However, the getvalue function of StringIO can give the contents in a pickable format, so grab the contents with getvalue, close the StringIO object and then put the contents into the outbox. 但是， getvalue函数可以以可选格式提供内容，因此使用getvalue获取内容，关闭StringIO对象，然后将内容放入发件箱。

from io import StringIO
import tarfile

def compressHandler(inbox, outbox):
    "process that pulls from inbox, compresses and puts to outbox"
    supplier = iter(inbox.get, None)  # stops when gets a None
    while True:
        try:
            data = next(supplier)  # grab data from inbox
            pressed = compress(data)  # compress it
            ou_que.put(pressed)  # put into outbox
        except StopIteration:
            outbox.put(None)  # finished compressing, inform the writer
            return  # and quit

def compress(data):
    "compress file"
    bz_file = StringIO()

    fname, lines = dat  # see reader def for package order

    with tarfile.open(mode='w:bz2', fileobj=bz_file) as tar:

        info = tarfile.TarInfo(fname)  # store file name
        tar.addfile(info, StringIO(''.join(lines)))  # compress

    data = bz_file.getvalue()
    bz_file.close()
    return data

The writer process then extracts the contents from the outbox queue and writes them to disk. 然后，编写器进程从发件箱队列中提取内容并将其写入磁盘。 This function will need to know how many compression processes were started so it knows only to stop when it has heard that every process has stopped. 此函数需要知道启动了多少压缩进程，因此它只知道在听到每个进程都已停止时停止。

def writer(outbox, output_path, num_procs):
    "single process that writes compressed files to disk"
    num_fin = 0

    while True:
        # all compression processes have finished
        if num_finished >= num_procs: break

        tardata = outbox.get()

        # a compression process has finished
        if tardata == None:
            num_fin += 1
            continue

        fn, data = tardata
        name = os.path.join(output_path, fn) + '.tbz'

        with open(name, 'wb') as dst: dst.write(data)
    return

Finally, there's the set up to put them all together 最后，还有将它们放在一起的设置

import multiprocessing as mp
import os

def setup():
    fld = 'file/path'

    # multiprocess setup
    num_procs = mp.cpu_count()

    # inbox and outbox queues
    inbox = mp.Queue(4*num_procs)  # limit size 
    outbox = mp.Queue()

    # one process to read
    reader = mp.Process(target = reader, args = (inbox, fld, num_procs))
    reader.start()

    # n processes to compress
    compressors = [mp.Process(target = compressHandler, args = (inbox, outbox))
                   for i in range(num_procs)]
    for c in compressors: c.start()

    # one process to write
    writer = mp.Process(target = writer, args=(outbox, fld, num_procs))
    writer.start()
    writer.join()  # wait for it to finish
    print('done!')

使用多处理读取，压缩，写入

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-02-02 21:31:53

使用多处理读取，压缩，写入

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-02-02 21:31:53

解决方案1
4 已采纳 2017-02-02 21:31:53