在大型批处理中，如何使用池或队列优化流程？

Question

我正在尝试尽快在CSV文件的每一行上执行功能。 我的代码可以工作，但是我知道，如果我更好地利用multiprocessing库，它可能会更快。

processes = []

def execute_task(task_details):
    #work is done here, may take 1 second, may take 10
    #send output to another function

with open('twentyThousandLines.csv', 'rb') as file:
    r = csv.reader(file)
    for row in r:
        p = Process(target=execute_task, args=(row,))
        processes.append(p)
        p.start()

for p in processes:
    p.join()

我在想我应该将任务放入Queue并使用Pool处理它们，但是所有示例都表明Queue不能按照我假设的方式工作，并且我无法将Pool映射到不断扩展的Queue 。

Answer 1

我已经使用过类似的东西Pool的工人。

    from multiprocessing import Pool, cpu_count

    def initializer(arg1, arg2):
        # Do something to initialize (if necessary)

    def process_csv_data(data):
        # Do something with the data

    pool = Pool(cpu_count(), initializer=initializer, initargs=(arg1, arg2))

    with open("csv_data_file.csv", "rb") as f:
        csv_obj = csv.reader(f)
        for row in csv_obj:
            pool.apply_async(process_csv_data, (row,))

但是，正如pvg在您的问题下所述，您可能需要考虑如何批量处理数据。 逐行进行可能不是正确的粒度级别。

您可能还希望进行概要分析/测试以找出瓶颈。 例如，如果磁盘访问限制了您，则并行化可能不会给您带来好处。

mulprocessing.Queue是在流程之间交换对象的一种方法，因此您无需执行任务。

Answer 2

对我来说，看来您实际上是在努力加速

def check(row):
    # do the checking
    return (row,result_of_check)

with open('twentyThousandLines.csv', 'rb') as file:
    r = csv.reader(file)
    for row,result in map(check,r):
        print(row,result)

可以用

#from multiprocessing import Pool # if CPU-bound (but even then not alwys)
from multiprocessing.dummy import Pool # if IO-bound


def check(row):
    # do the checking
    return (row,result_of_check)

if __name__=="__main__": #in case you are using processes on windows
    with open('twentyThousandLines.csv', 'rb') as file:
        r = csv.reader(file)
        with Pool() as p: # before python 3.3 you should do close() and join() explicitly
            for row,result in p.imap_unordered(check,r, chunksize=10): # just a quess - you have to guess/experiement a bit to find the best value
                print(row,result)

创建进程会花费一些时间（尤其是在Windows上），因此在大多数情况下，通过multiprocessing使用线程是可行的。dummy更快（而且multiprocessing并非完全无关紧要-参见指南）。

在大型批处理中，如何使用池或队列优化流程？

问题描述

2 个解决方案

解决方案1
0 2016-12-16 00:31:34

解决方案2
0 2016-12-16 10:38:47

在大型批处理中，如何使用池或队列优化流程？

问题描述

2 个解决方案

解决方案1 0 2016-12-16 00:31:34

解决方案2 0 2016-12-16 10:38:47

解决方案1
0 2016-12-16 00:31:34

解决方案2
0 2016-12-16 10:38:47