简体   繁体   中英

How to configure celery worker on distributed airflow architecture using docker-compose?

I'm setting up a distributed Airflow cluster where everything else except the celery workers are run on one host and processing is done on several hosts. The airflow2.0 setup is configured using the yaml file given at the Airflow documentation https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml . In my initial tests I got the architecture to work nicely when I run everything at the same host. The question is, how to start the celery workers at the remote hosts?

So far, I tried to create a trimmed version of the above docker-compose where I only start the celery workers at the worker host and nothing else. But I run into some issues with db connection. In the trimmed version I changed the URL so that they point to the host that runs the db and redis.

dags, logs, plugins and the postgresql db are located on a shared drive that is visible to all hosts.

How should I do the configuration? Any ideas what to check? Connections etc.? Celery worker docker-compose configuration:

---
version: '3'
x-airflow-common:
  &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
  environment:
    &airflow-common-env
    AIRFLOW_UID: 50000
    AIRFLOW_GID: 50000
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: 
postgresql+psycopg2://airflow:airflow@airflowhost.example.com:8080/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@airflow@airflowhost.example.com:8080/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@airflow@airflowhost.example.com:6380/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    REDIS_PORT: 6380
   volumes:
    - /airflow/dev/dags:/opt/airflow/dags
    - /airflow/dev/logs:/opt/airflow/logs
    - /airflow/dev/plugins:/opt/airflow/plugins
   user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
services:
  airflow-remote-worker:
    <<: *airflow-common
    command: celery worker
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

EDIT 1: I'm Still having some difficulties with the log files. It appears that sharing the log directory doesn't solve the issue of missing log files. I added the extra_host definition on main like suggested and opened the port 8793 on the worker machine. The worker tasks fail with log:

*** Log file does not exist: 
/opt/airflow/logs/tutorial/print_date/2021-07- 
01T13:57:11.087882+00:00/1.log
*** Fetching from: http://:8793/log/tutorial/print_date/2021-07-01T13:57:11.087882+00:00/1.log
*** Failed to fetch log file from worker. Unsupported URL protocol ''

Far from being the "ultimate set-up", these are some settings that worked for me using the docker-compose from Airflow in the core node and the workers:

Main node:

  • The worker nodes have to be reachable from the main node where the Webserver runs. I found this diagram of the CeleryExecutor architecture to be very helpful to sort things out.

    When trying to read the logs, if they are not found locally, it will try to retrieve them from the remote worker. Thus your main node may not know the hostname of your workers, so you either change how the hostnames are being resolved ( hostname_callable setting, which defaults to socket.getfqdn ) or you just simply add name resolution capability to the Webserver . This could be done by adding the extra_hosts config key in the x-airflow-common definition:

---
version: "3"
x-airflow-common: &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
  environment: &airflow-common-env
    ...# env vars
  extra_hosts:
    - "worker-01-hostname:worker-01-ip-address" # "worker-01-hostname:192.168.0.11"
    - "worker-02-hostname:worker-02-ip-address"

* Note that in your specific case where you have a shared drive, so I think the logs will be found locally.

  • Define parallelism , DAG concurrency , and scheduler parsing processes . Could be done by using env vars:
x-airflow-common: &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
  environment: &airflow-common-env
    AIRFLOW__CORE__PARALLELISM: 64
    AIRFLOW__CORE__DAG_CONCURRENCY: 32
    AIRFLOW__SCHEDULER__PARSING_PROCESSES: 4

Of course, the values to be set depend on your specific case and available resources. This article has a good overview of the subject. DAG settings could also be overridden at DAG definition.

Worker nodes:

  • Define worker CELERY__WORKER_CONCURRENCY , default could be the numbers of CPUs available on the machine ( docs ).

  • Define how to reach the services running in the main node. Set an IP or hostname and watch out for matching exposed ports in the main node:

x-airflow-common: &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
  environment: &airflow-common-env
  AIRFLOW__CORE__EXECUTOR: CeleryExecutor
  AIRFLOW__CELERY__WORKER_CONCURRENCY: 8
  AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@main_node_ip_or_hostname:5432/airflow # 5432 is default postgres port
  AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@main_node_ip_or_hostname:5432/airflow
  AIRFLOW__CELERY__BROKER_URL: redis://:@main_node_ip_or_hostname:6379/0
  environment: &airflow-common-env
    AIRFLOW__CORE__FERNET_KEY: ${FERNET_KEY}
    AIRFLOW__WEBSERVER__SECRET_KEY: ${SECRET_KEY}

  env_file:
    - .env

.env file: FERNET_KEY=jvYUaxxxxxxxxxxxxx=

  • It's critical that every node in the cluster (main and workers) has the same settings applied.

  • Define a hostname to the worker service to avoid autogenerated matching the container id.

  • Expose port 8793, which is the default port used to fetch the logs from the worker ( docs ):

services:
  airflow-worker:
    <<: *airflow-common
    hostname: ${HOSTNAME}
    ports:
      - 8793:8793
    command: celery worker
    restart: always
  • Make sure every worker node host is running with the same time configuration, a few minutes difference could cause serious execution errors which may not be so easy to find. Consider enabling NTP service on host OS.

If you have heavy workloads and high concurrency in general, you may need to tune Postgres settings such as max_connections and shared_buffers . The same applies to the host OS network settings such as ip_local_port_range or somaxconn .

In any issues I encountered during the initial cluster setup, Flower and the worker execution logs always provided helpful details and error messages, both the task-level logs and the Docker-Compose service log ie: docker-compose logs --tail=10000 airflow-worker > worker_logs.log .

Hope that works for you!

The following considerations build on the accepted answer , as I think they might be relevant to any new Airflow Celery setup:

  • Enabling remote logging usually comes in handy in a distributed setup as a way to centralize logs. Airflow supports remote logging natively, see eg this or this
  • Defining worker_autoscale instead of concurrency will allow to dynamically start/stop new processes when the workload increases/decreases
  • Setting the environment variable DUMB_INIT_SETSID to 0 in the worker's environment allows for warm shutdowns (see the docs )
  • Adding volumes to the worker in the Docker Compose pointing at Airflow's base_log_folder allows to safely persist the worker logs locally. Example:
# docker-compose.yml

services:
  airflow-worker:
     ...
      volumes:
        - worker_logs:/airflow/logs
     ...
  ...
volumes:
  worker_logs:

I can't solve my proplem.can you help me.I used the docker develoment my airflow and celery. could you send me a MainNode docker-compose.yml and Woker docker-compose.yml,Thanks very much!!!

*** Log file does not exist: /opt/airflow/logs/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log
*** Fetching from: http://eosbak01.zzz.ac.cn:8793/log/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log
*** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
****** See more at https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key
****** Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url 'http://eosbak01.zzz.ac.cn:8793/log/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log'
For more information check: https://httpstatuses.com/403

WorkerNode

[airflow@eosbak01 deploy]$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED              STATUS                        PORTS      NAMES
2ef39a54de97   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   About a minute ago   Up About a minute (healthy)   8080/tcp   deploy_airflow-worker_

MainNode

[airflow@eosbak02 deploy]$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS                    PORTS                                                  NAMES
1e6a7e50831d   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   26 minutes ago   Up 26 minutes (healthy)   0.0.0.0:5555->5555/tcp, :::5555->5555/tcp, 8080/tcp    deploy_flower_1
9afb5985b9f3   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   27 minutes ago   Up 27 minutes (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp              deploy_airflow-webserver_1
80132177ae3d   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   27 minutes ago   Up 27 minutes (healthy)   8080/tcp                                               deploy_airflow-triggerer_1
6ea5a0ed7dec   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   27 minutes ago   Up 27 minutes (healthy)   8080/tcp                                               deploy_airflow-scheduler_1
2787acb189ad   mysql:8.0.27           "docker-entrypoint.s…"   29 minutes ago   Up 29 minutes (healthy)   0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp   deploy_mysql_1
057af26f6070   redis:latest           "docker-entrypoint.s…"   29 minutes ago   Up 29 minutes (healthy)   0.0.0.0:6379->6379/tcp, :::6379->6379/tcp              deploy_redis_1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM