Starting celery worker from multiprocessing

Question

I'm new to celery. All of the examples I've seen start a celery worker from the command line. eg:

$ celery -A proj worker -l info

I'm starting a project on elastic beanstalk and thought it would be nice to have the worker be a subprocess of my web app. I tried using multiprocessing and it seems to work. I'm wondering if this is a good idea, or if there might be some disadvantages.

import celery
import multiprocessing


class WorkerProcess(multiprocessing.Process):
    def __init__(self):
        super().__init__(name='celery_worker_process')

    def run(self):
        argv = [
            'worker',
            '--loglevel=WARNING',
            '--hostname=local',
        ]
        app.worker_main(argv)


def start_celery():
    global worker_process
    worker_process = WorkerProcess()
    worker_process.start()


def stop_celery():
    global worker_process
    if worker_process:
        worker_process.terminate()
        worker_process = None


worker_name = 'celery@local'
worker_process = None

app = celery.Celery()
app.config_from_object('celery_app.celeryconfig')

Answer 1

Seems like a good option, definitely not the only option but a good one :)

One thing you might want to look into (you might already be doing this), is linking the autoscaling to the size of your Celery queue. So you only scale up when the queue is growing.

Effectively Celery does something similar internally of course, so there's not a lot of difference. The only snag I can think of is the handling of external resources (database connections for example), that might be a problem but is completely dependent on what you are doing with Celery.

Answer 2

If anyone is interested, I did get this working on Elastic Beanstalk with a pre-configured AMI server running Python 3.4. I had a lot of problems with the Docker based server running Debian Jessie. Something to do with port remapping, maybe. Docker is kind of a black box, and I've found it very hard to work with and debug. Fortunately, the good folks at AWS just added a non-docker Python 3.4 option on April 8, 2015.

I did a lot of searching to get this deployed and working. I saw lots of questions without answers. So here's my very simple deployed python 3.4/flask/celery process.

Celery you can just pip install. You'll need to install rabbitmq from a configuration file with a config command or container_command. I'm using a script in my uploaded project zip, so a container_command is necessary to use the script (regular eb config command takes place before the project is installed).

[yourapproot]/.ebextensions/05_install_rabbitmq.config:

container_commands:
  01RunScript:
    command: bash ./init_scripts/app_setup.sh

[yourapproot]/init_scripts/app_setup.sh:

#!/usr/bin/env bash

# Download and install Erlang
yum install erlang

# Download the latest RabbitMQ package using wget:
wget http://www.rabbitmq.com/releases/rabbitmq-server/v3.5.1/rabbitmq-server-3.5.1-1.noarch.rpm

# Install rabbit
rpm --import http://www.rabbitmq.com/rabbitmq-signing-key-public.asc
yum -y install rabbitmq-server-3.5.1-1.noarch.rpm

# Start server
/sbin/service rabbitmq-server start

I'm doing a flask app, so I startup the workers before the first request:

@app.before_first_request
def before_first_request():
    task_mgr.start_celery()

The task_mgr creates the celery app object (which I call celery, since the flask app object is app). The -Ofair is pretty key here, for a simple task manager. There's all kinds of strange behavior with task prefetch. This should maybe be the default?

task_mgr/task_mgr.py:

import celery as celery_module
import multiprocessing


class WorkerProcess(multiprocessing.Process):
    def __init__(self):
        super().__init__(name='celery_worker_process')

    def run(self):
        argv = [
            'worker',
            '--loglevel=WARNING',
            '--hostname=local',
            '-Ofair',
        ]
        celery.worker_main(argv)


def start_celery():
    global worker_process
    multiprocessing.set_start_method('fork')  # 'spawn' seems to work also
    worker_process = WorkerProcess()
    worker_process.start()


def stop_celery():
    global worker_process
    if worker_process:
        worker_process.terminate()
        worker_process = None


worker_name = 'celery@local'
worker_process = None

celery = celery_module.Celery()
celery.config_from_object('task_mgr.celery_config')

My config is pretty simple so far:

task_mgr/celery_config.py:

BROKER_URL = 'amqp://'
CELERY_RESULT_BACKEND = 'amqp://'

CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'  # 'pickle' warning: can't use datetime in json
CELERY_RESULT_SERIALIZER = 'json'  # 'pickle' warning: can't use datetime in json
CELERY_TASK_RESULT_EXPIRES = 18000  # Results hang around for 5 hours

CELERYD_CONCURRENCY = 4

Then you can put tasks wherever you need them:

from task_mgr.task_mgr import celery
import time


@celery.task(bind=True)
def error_task(self):
    self.update_state(state='RUNNING')
    time.sleep(10)
    raise KeyError('im an error')


@celery.task(bind=True)
def long_task(self):
    self.update_state(state='RUNNING')
    time.sleep(20)
    return 'long task finished'


@celery.task(bind=True)
def task_with_status(self, wait):
    self.update_state(state='RUNNING')
    for i in range(5):
        time.sleep(wait)
        self.update_state(
            state='PROGRESS',
            meta={
                'current': i + 1,
                'total': 5,
                'status': 'progress',
                'host': self.request.hostname,
            }
        )
    time.sleep(wait)
    return 'finished with wait = ' + str(wait)

I also keep a task queue to hold the async results so I can monitor the tasks:

task_queue = []


def queue_task(task, *args):
    async_result = task.apply_async(args)
    task_queue.append(
        {
            'task_name':task.__name__,
            'task_args':args,
            'async_result':async_result
        }
    )
    return async_result


def get_tasks_info():
    tasks = []

    for task in task_queue:
        task_name = task['task_name']
        task_args = task['task_args']
        async_result = task['async_result']
        task_id = async_result.id
        task_state = async_result.state
        task_result_info = async_result.info
        task_result = async_result.result
        tasks.append(
            {
                'task_name': task_name,
                'task_args': task_args,
                'task_id': task_id,
                'task_state': task_state,
                'task_result.info': task_result_info,
                'task_result': task_result,
            }
        )

    return tasks

And of course, start the tasks where you need to:

from webapp.app import app
from flask import url_for, render_template, redirect
from webapp import tasks
from task_mgr import task_mgr


@app.route('/start_all_tasks')
def start_all_tasks():
    task_mgr.queue_task(tasks.long_task)
    task_mgr.queue_task(tasks.error_task)
    for i in range(1, 9):
        task_mgr.queue_task(tasks.task_with_status, i * 2)

    return redirect(url_for('task_status'))


@app.route('/task_status')
def task_status():
    current_tasks = task_mgr.get_tasks_info()
    return render_template(
        'parse/task_status.html',
        tasks=current_tasks
    )

And that's about it. Let me know if you need any help, though my celery knowledge is still fairly limited.

Starting celery worker from multiprocessing

Question

2 answers

solution1
2 ACCPTED 2015-04-26 13:25:41

solution2
1 2015-05-02 20:44:08

Starting celery worker from multiprocessing

Question

2 answers

solution1 2 ACCPTED 2015-04-26 13:25:41

solution2 1 2015-05-02 20:44:08

solution1
2 ACCPTED 2015-04-26 13:25:41

solution2
1 2015-05-02 20:44:08