简体   繁体   English

Python 多处理池:在任务执行期间动态设置进程数

[英]Python multiprocessing pool: dynamically set number of processes during execution of tasks

We submit large CPU intensive jobs in Python 2.7 (that consist of many independent parallel processes) on our development machine which last for days at a time.我们在开发机器上使用 Python 2.7(由许多独立的并行进程组成)提交大型 CPU 密集型作业,这些作业一次持续数天。 The responsiveness of the machine slows down a lot when these jobs are running with a large number of processes.当这些作业与大量进程一起运行时,机器的响应速度会大大降低。 Ideally, I would like to limit the number of CPU available during the day when we're developing code and over night run as many processes as efficiently possible.理想情况下,当我们开发代码时,我想限制白天可用的 CPU 数量,并在夜间尽可能高效地运行尽可能多的进程。

The Python multiprocessing library allows you to specify the number of process when you initiate a Pool. Python 多处理库允许您在启动池时指定进程数。 Is there a way to dynamically change this number each time a new task is initiated?每次启动新任务时,有没有办法动态更改此数字?

For instance, allow 20 processes to run during the hours 19-07 and 10 processes from hours 07-19.例如,允许在 19-07 小时运行 20 个进程,在 07-19 小时运行 10 个进程。

One way would be to check the number of active processes using significant CPU.一种方法是检查使用大量 CPU 的活动进程的数量。 This is how I would like it to work:这就是我希望它工作的方式:

from multiprocessing import Pool
import time 

pool = Pool(processes=20)

def big_task(x):
    while check_n_process(processes=10) is False:
        time.sleep(60*60)
    x += 1
    return x 


x = 1
multiple_results = [pool.apply_async(big_task, (x)) for i in range(1000)]
print([res.get() for res in multiple_results])

But I would need to write the 'check_n_process' function.但我需要编写“check_n_process”函数。

Any other ideas how this problem could be solved?任何其他想法如何解决这个问题?

(The code needs to run in Python 2.7 - a bash implementation is not feasible). (代码需要在 Python 2.7 中运行 - bash 实现是不可行的)。

Python multiprocessing.Pool does not provide a way to change the amount of workers of a running Pool . Python multiprocessing.Pool不提供更改正在运行的Pool的工作Pool A simple solution would be relying on third party tools.一个简单的解决方案是依赖第三方工具。

The Pool provided by billiard used to provide such a feature.池提供通过billiard用于提供这样的功能。

Task queue frameworks like Celery or Luigi surely allow a flexible workload but are way more complex.CeleryLuigi这样的任务队列框架肯定允许灵活的工作负载,但要复杂得多。

If the use of external dependencies is not feasible, you can try the following approach.如果使用外部依赖不可行,可以试试下面的方法。 Elaborating from this answer , you could set a throttling mechanism based on a Semaphore.这个答案中详细说明,您可以设置基于信号量的节流机制。

from threading import Semaphore, Lock
from multiprocessing import Pool

def TaskManager(object):
    def __init__(self, pool_size):
        self.pool = Pool(processes=pool_size)
        self.workers = Semaphore(pool_size)
        # ensures the semaphore is not replaced while used
        self.workers_mutex = Lock()  

    def change_pool_size(self, new_size):
        """Set the Pool to a new size."""
        with self.workers_mutex:  
            self.workers = Semaphore(new_size)

    def new_task(self, task):
        """Start a new task, blocks if queue is full."""
        with self.workers_mutex:
            self.workers.acquire()

        self.pool.apply_async(big_task, args=[task], callback=self.task_done))

    def task_done(self):
        """Called once task is done, releases the queue is blocked."""
        with self.workers_mutex:
            self.workers.release()

The pool would block further attempts to schedule your big_tasks if more than X workers are busy.如果超过 X 个工作人员忙,该池将阻止进一步尝试安排您的big_tasks By controlling this mechanism you could throttle the amount of processes running concurrently.通过控制此机制,您可以限制并发运行的进程数量。 Of course, this means that you give up the Pool queueing mechanism.当然,这意味着你放弃了Pool排队机制。

task_manager = TaskManager(20)

while True:
    if seven_in_the_morning():
        task_manager.change_pool_size(10)
    if seven_in_the_evening():
        task_manager.change_pool_size(20)

    task = get_new_task()
    task_manager.new_task()  # blocks here if all workers are busy

This is woefully incomplete (and an old question), but you can manage the load by keeping track of the running processes and only calling apply_async() when it's favorable;这是非常不完整的(也是一个老问题),但是您可以通过跟踪正在运行的进程并仅在有利时调用 apply_async() 来管理负载; if each job runs for less than forever, you can drop the load by dispatching fewer jobs during working hours, or when os.getloadavg() is too high.如果每个作业的运行时间少于永远,您可以通过在工作时间或 os.getloadavg() 太高时调度更少的作业来降低负载。 I do this to manage network load when running multiple "scp"s to evade traffic shaping on our internal network (don't tell anyone!)我这样做是为了在运行多个“scp”以逃避我们内部网络上的流量整形时管理网络负载(不要告诉任何人!)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM