Python多处理Pool.map并不比调用函数一次快

Question

I have a very large list of strings (originally from a text file) that I need to process using python. 我有一个非常大的字符串列表（最初来自文本文件），我需要使用python进行处理。 Eventually I am trying to go for a map-reduce style of parallel processing. 最终我试图采用map-reduce风格的并行处理。

I have written a "mapper" function and fed it to multiprocessing.Pool.map() , but it takes the same amount of time as simply calling the mapper function with the full set of data. 我编写了一个“映射器”函数并将其提供给multiprocessing.Pool.map() ，但它与使用完整数据集调用mapper函数所花费的时间相同。 I must be doing something wrong. 我一定做错了什么。

I have tried multiple approaches, all with similar results. 我尝试了多种方法，都有类似的结果。

def initial_map(lines):
    results = []
    for line in lines:
        processed = # process line (O^(1) operation)
        results.append(processed)
    return results

def chunks(l, n):
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

if __name__ == "__main__":
    lines = list(open("../../log.txt", 'r'))
    pool = Pool(processes=8)
    partitions = chunks(lines, len(lines)/8)
    results = pool.map(initial_map, partitions, 1)

So the chunks function makes a list of sublists of the original set of lines to give to the pool.map() , then it should hand these 8 sublists to 8 different processes and run them through the mapper function. 因此，块函数会生成一组原始行的子列表以提供给pool.map() ，然后它应该将这8 pool.map()列表交给8个不同的进程并通过映射器函数运行它们。 When I run this I can see all 8 of my cores peak at 100%. 当我运行它时，我可以看到我的所有8个核心达到峰值100％。 Yet it takes 22-24 seconds. 然而它需要22-24秒。

When I simple run this (single process/thread): 当我简单地运行它（单个进程/线程）时：

lines = list(open("../../log.txt", 'r'))
results = initial_map(results)

It takes about the same amount of time. 它需要大约相同的时间。 ~24 seconds. ~24秒。 I only see one process getting to 100% CPU. 我只看到一个进程达到100％CPU。

I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results. 我还尝试让池分割线本身，并使mapper函数一次只处理一行，结果相似。

def initial_map(line):
    processed = # process line (O^(1) operation)
    return processed

if __name__ == "__main__":
    lines = list(open("../../log.txt", 'r'))
    pool = Pool(processes=8)
    pool.map(initial_map, lines)

~22 seconds. ~22秒。

Why is this happening? 为什么会这样？ Parallelizing this should result in faster results, shouldn't it? 并行化这应该会导致更快的结果，不是吗？

Answer 1

If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. 如果在一次迭代中完成的工作量非常小，那么您只需花费很大一部分时间与子进程进行通信，这是很昂贵的。 Instead, try to pass bigger slices of your data to the processing function. 相反，尝试将更大的数据片段传递给处理函数。 Something like the following: 类似于以下内容：

slices = (data[i:i+100] for i in range(0, len(data), 100)

def process_slice(data):
    return [initial_data(x) for x in data]

pool.map(process_slice, slices)

# and then itertools.chain the output to flatten it

(don't have my comp. so can't give you a full working solution nor verify what I said) （没有我的补偿。所以不能给你一个完整的工作解决方案，也不能验证我所说的）

Edit: or see the 3rd comment on your question by @ubomb. 编辑：或者通过@ubomb查看问题的第3条评论。

Python多处理Pool.map并不比调用函数一次快

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-04-29 20:12:16

Python多处理Pool.map并不比调用函数一次快

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-04-29 20:12:16

解决方案1
1 已采纳 2014-04-29 20:12:16