简体   繁体   English

芹菜-批量队列任务

[英]Celery - bulk queue tasks

I have some code that queues a large number (1000s) of celery tasks, for example's sake let's say it's this: 我有一些代码将大量(1000s)的芹菜任务排队,例如,让我们说一下是这样的:

for x in xrange(2000):
    example_task.delay(x)

Is there a better/more efficient way of queuing a large number of tasks at once? 有没有一种更好/更有效的方式来同时处理大量任务的方法? They all have different arguments. 他们都有不同的论点。

We ran into this problem too when we wanted to use Celery to process several million PDFs. 当我们想使用Celery处理数百万个PDF时,我们也遇到了这个问题。 Our solution was to write something we call the CeleryThrottle . 我们的解决方案是写一些我们称为CeleryThrottle东西。 Basically, you configure the throttle with a desired Celery queue and the number of tasks you want in it, and then you create your tasks in a loop. 基本上,您为节流阀配置了所需的Celery队列和所需的任务数,然后在循环中创建任务。 As you create your tasks, the throttle will monitor the length of the actual queue. 在创建任务时,节流阀将监视实际队列的长度。 If it's being depleted too quickly, it'll speed up your loop for a while so more tasks are added to the queue. 如果消耗的速度过快,则会加快循环速度,因此会将更多任务添加到队列中。 If the queue is growing too large, it will slow down your loop and let some of the tasks complete. 如果队列过大,则会减慢循环速度并让某些任务完成。

Here's the code: 这是代码:

class CeleryThrottle(object):
    """A class for throttling celery."""

    def __init__(self, min_items=100, queue_name='celery'):
        """Create a throttle to prevent celery run aways.

        :param min_items: The minimum number of items that should be enqueued. 
        A maximum of 2× this number may be created. This minimum value is not 
        guaranteed and so a number slightly higher than your max concurrency 
        should be used. Note that this number includes all tasks unless you use
        a specific queue for your processing.
        """
        self.min = min_items
        self.max = self.min * 2

        # Variables used to track the queue and wait-rate
        self.last_processed_count = 0
        self.count_to_do = self.max
        self.last_measurement = None
        self.first_run = True

        # Use a fixed-length queue to hold last N rates
        self.rates = deque(maxlen=15)
        self.avg_rate = self._calculate_avg()

        # For inspections
        self.queue_name = queue_name

    def _calculate_avg(self):
        return float(sum(self.rates)) / (len(self.rates) or 1)

    def _add_latest_rate(self):
        """Calculate the rate that the queue is processing items."""
        right_now = now()
        elapsed_seconds = (right_now - self.last_measurement).total_seconds()
        self.rates.append(self.last_processed_count / elapsed_seconds)
        self.last_measurement = right_now
        self.last_processed_count = 0
        self.avg_rate = self._calculate_avg()

    def maybe_wait(self):
        """Stall the calling function or let it proceed, depending on the queue.

        The idea here is to check the length of the queue as infrequently as 
        possible while keeping the number of items in the queue as closely 
        between self.min and self.max as possible.

        We do this by immediately enqueueing self.max items. After that, we 
        monitor the queue to determine how quickly it is processing items. Using 
        that rate we wait an appropriate amount of time or immediately press on.
        """
        self.last_processed_count += 1
        if self.count_to_do > 0:
            # Do not wait. Allow process to continue.
            if self.first_run:
                self.first_run = False
                self.last_measurement = now()
            self.count_to_do -= 1
            return

        self._add_latest_rate()
        task_count = get_queue_length(self.queue_name)
        if task_count > self.min:
            # Estimate how long the surplus will take to complete and wait that
            # long + 5% to ensure we're below self.min on next iteration.
            surplus_task_count = task_count - self.min
            wait_time = (surplus_task_count / self.avg_rate) * 1.05
            time.sleep(wait_time)

            # Assume we're below self.min due to waiting; max out the queue.
            if task_count < self.max:
                self.count_to_do = self.max - self.min
            return

        elif task_count <= self.min:
            # Add more items.
            self.count_to_do = self.max - task_count
            return

And we use it like: 我们像这样使用它:

throttle = CeleryThrottle(min_items=30, queue_name=queue)
for item in items:
    throttle.maybe_wait()
    do_something.delay()

So it's pretty simple to use, and it does a pretty good job of keeping the queue in a happy place — not too long, not too short. 因此,使用起来非常简单,并且可以很好地将队列保持在一个快乐的地方-不会太长,也不会太短。 It keeps a rolling average of the rate that the queue is depleting, and it can adjust it's own timers accordingly. 它可以保持队列耗尽的速率的滚动平均值,并且可以相应地调整自己的计时器。

Invoking large number of tasks could not be healthy for your celery workers. 调用大量任务可能对您的芹菜工作者不利。 Also if you are considering collecting result of invoked task then your code will not be optimal. 同样,如果您正在考虑收集调用任务的结果,那么您的代码将不是最佳的。

You can chunck your tasks in batches of certain size. 您可以分批批量处理任务。 Consider example mentioned in below link. 考虑以下链接中提到的示例。

http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM