简体   繁体   English

如何更改并行进程数?

[英]How to change number of parallel processes?

I have a python script which runs a method in parallel. 我有一个python脚本,它并行运行一个方法。

parsers = {
    'parser1': parser1.process,
    'parser2': parser2.process
}

def process((key, value)):
    parsers[key](value)

pool = Pool(4)
pool.map(process_items, items)

process_items is my method and items is a list of tuples with two elements to each tuple. process_items是我的方法, items是一个元组列表,每个元组有两个元素。 The items list has around 100k items. items清单有大约10万件物品。

process_items will then call a method depending on what parameters are given. 然后, process_items将根据给定的参数调用方法。 My problem being maybe 70% of the list I can run with high parallelism but the other 30% can only run with 1/2 threads otherwise will cause a failure outside of my control. 我的问题可能是列表的70%,我可以运行高并行性,但另外30%只能运行1/2线程,否则将导致我的控制之外的失败。

So in my code I have around 10 different parser processes. 所以在我的代码中,我有大约10个不同的解析器进程。 For say 1-8 I want to run with Pool(4) but 9-10 Pool(2). 比方说1-8我想用Pool(4)但9-10 Pool(2)运行。

What is the best way to optimise this? 优化这个的最佳方法是什么?

I think your best option is to use two pools here: 我认为你最好的选择是在这里使用两个池:

from multiprocessing import Pool
# import parsers here

parsers = {
    'parser1': parser1.process,
    'parser2': parser2.process,
    'parser3': parser3.process,
    'parser4': parser4.process,
    'parser5': parser5.process,
    'parser6': parser6.process,
    'parser7': parser7.process,
}

# Sets that define which items can use high parallelism,
# and which must use low
high_par = {"parser1", "parser3", "parser4", "parser6", "parser7"}
low_par = {"parser2", "parser5"}

def process_items(key, value):
    parsers[key](value)

def run_pool(func, items, num_items, check_set):
    pool = Pool(num_items)
    out = pool.map(func, (item for item in items if item[0] in check_set))
    pool.close()
    pool.join()
    return out

if __name__ == "__main__":
    items = [('parser2', x), ...] # Your list of tuples
    # Process with high parallelism
    high_results = run_pool(process_items, items, 4, high_par)
    # Process with low parallelism
    low_results = run_pool(process_items, items, 2, low_par)

Trying to do this in one Pool is possible, through clever use of synchronization primitives, but I don't think it will end up looking much cleaner than this. 通过巧妙地使用同步原语,可以尝试在一个Pool执行此操作,但我认为它看起来不会比这更清晰。 It's also could end up running less efficiently, since sometimes your pool will need to wait around for work to finish, so it can process a low parallelism item, even when high parallelism items are available behind it in the queue. 它也可能最终运行效率较低,因为有时您的池需要等待工作完成,因此它可以处理低并行项,即使队列中有高并行项可用。

This would get complicated a bit if you needed to get the results from each process_items call in the same order as they fell in the original iterable, meaning the results from each Pool need to get merged, but based on your example I don't think that's a requirement. 如果您需要以与原始迭代中相同的顺序获取每个process_items调用的结果,这会变得有点复杂,这意味着每个Pool的结果需要合并,但基于您的示例我不认为这是一个要求。 Let me know if it is, and I'll try to adjust my answer accordingly. 如果是的话请告诉我,我会相应地调整我的答案。

You can specify the number of parallel threads in the constructor for multiprocessing.Pool : 您可以在构造函数中指定multiprocessing.Pool的并行线程数multiprocessing.Pool

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    pool = Pool(5)   # 5 is the number of parallel threads
    print pool.map(f, [1, 2, 3])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM