简体   繁体   English

如何使用 docker-compose 在分布式 airflow 架构上配置 celery worker?

[英]How to configure celery worker on distributed airflow architecture using docker-compose?

I'm setting up a distributed Airflow cluster where everything else except the celery workers are run on one host and processing is done on several hosts.我正在设置一个分布式 Airflow 集群,其中除了 celery 工作人员之外的所有其他内容都在一台主机上运行,并在多台主机上完成处理。 The airflow2.0 setup is configured using the yaml file given at the Airflow documentation https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml . The airflow2.0 setup is configured using the yaml file given at the Airflow documentation https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml . In my initial tests I got the architecture to work nicely when I run everything at the same host.在我最初的测试中,当我在同一台主机上运行所有东西时,我的架构可以很好地工作。 The question is, how to start the celery workers at the remote hosts?问题是,如何在远程主机上启动 celery 工作人员?

So far, I tried to create a trimmed version of the above docker-compose where I only start the celery workers at the worker host and nothing else.到目前为止,我尝试创建上述 docker-compose 的精简版本,其中我只在工作主机上启动 celery 工作程序,仅此而已。 But I run into some issues with db connection.但是我遇到了一些与数据库连接有关的问题。 In the trimmed version I changed the URL so that they point to the host that runs the db and redis.在精简版本中,我更改了 URL 以便它们指向运行 db 和 redis 的主机。

dags, logs, plugins and the postgresql db are located on a shared drive that is visible to all hosts. dag、日志、插件和 postgresql db 位于所有主机可见的共享驱动器上。

How should I do the configuration?我应该如何进行配置? Any ideas what to check?任何想法要检查什么? Connections etc.?连接等? Celery worker docker-compose configuration: Celery工人docker-compose配置:

---
version: '3'
x-airflow-common:
  &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
  environment:
    &airflow-common-env
    AIRFLOW_UID: 50000
    AIRFLOW_GID: 50000
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: 
postgresql+psycopg2://airflow:airflow@airflowhost.example.com:8080/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@airflow@airflowhost.example.com:8080/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@airflow@airflowhost.example.com:6380/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    REDIS_PORT: 6380
   volumes:
    - /airflow/dev/dags:/opt/airflow/dags
    - /airflow/dev/logs:/opt/airflow/logs
    - /airflow/dev/plugins:/opt/airflow/plugins
   user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
services:
  airflow-remote-worker:
    <<: *airflow-common
    command: celery worker
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

EDIT 1: I'm Still having some difficulties with the log files.编辑 1:我仍然对日志文件有一些困难。 It appears that sharing the log directory doesn't solve the issue of missing log files.看来共享日志目录并不能解决丢失日志文件的问题。 I added the extra_host definition on main like suggested and opened the port 8793 on the worker machine.我像建议的那样在 main 上添加了 extra_host 定义,并在工作机器上打开了端口 8793。 The worker tasks fail with log:工作任务失败并显示日志:

*** Log file does not exist: 
/opt/airflow/logs/tutorial/print_date/2021-07- 
01T13:57:11.087882+00:00/1.log
*** Fetching from: http://:8793/log/tutorial/print_date/2021-07-01T13:57:11.087882+00:00/1.log
*** Failed to fetch log file from worker. Unsupported URL protocol ''

Far from being the "ultimate set-up", these are some settings that worked for me using the docker-compose from Airflow in the core node and the workers:这些设置远非“终极设置”,而是在核心节点和工作程序中使用来自 Airflow 的 docker-compose 对我有用的一些设置:

Main node:主节点:

  • The worker nodes have to be reachable from the main node where the Webserver runs.必须可以从运行Webserver的主节点访问工作节点。 I found this diagram of the CeleryExecutor architecture to be very helpful to sort things out.我发现这个CeleryExecutor架构非常有助于解决问题。

    When trying to read the logs, if they are not found locally, it will try to retrieve them from the remote worker.在尝试读取日志时,如果在本地找不到它们,它将尝试从远程工作者那里检索它们。 Thus your main node may not know the hostname of your workers, so you either change how the hostnames are being resolved ( hostname_callable setting, which defaults to socket.getfqdn ) or you just simply add name resolution capability to the Webserver .因此,您的主节点可能不知道您的工作人员的主机名,因此您要么更改主机名的解析方式( hostname_callable设置,默认为socket.getfqdn ),要么只是简单地向Webserver添加名称解析功能。 This could be done by adding the extra_hosts config key in the x-airflow-common definition:这可以通过在x-airflow-common extra_hosts x-airflow-common定义中添加extra_hosts配置键来完成:

---
version: "3"
x-airflow-common: &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
  environment: &airflow-common-env
    ...# env vars
  extra_hosts:
    - "worker-01-hostname:worker-01-ip-address" # "worker-01-hostname:192.168.0.11"
    - "worker-02-hostname:worker-02-ip-address"

* Note that in your specific case where you have a shared drive, so I think the logs will be found locally. *请注意,在您拥有共享驱动器的特定情况下,我认为日志将在本地找到。

  • Define parallelism , DAG concurrency , and scheduler parsing processes .定义并行性DAG 并发性调度程序解析过程 Could be done by using env vars:可以通过使用 env vars 来完成:
x-airflow-common: &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
  environment: &airflow-common-env
    AIRFLOW__CORE__PARALLELISM: 64
    AIRFLOW__CORE__DAG_CONCURRENCY: 32
    AIRFLOW__SCHEDULER__PARSING_PROCESSES: 4

Of course, the values to be set depend on your specific case and available resources.当然,要设置的值取决于您的具体情况和可用资源。 This article has a good overview of the subject.这篇文章对这个主题有一个很好的概述。 DAG settings could also be overridden at DAG definition. DAG 设置也可以在DAG定义中被覆盖。

Worker nodes:工作节点:

  • Define worker CELERY__WORKER_CONCURRENCY , default could be the numbers of CPUs available on the machine ( docs ).定义 worker CELERY__WORKER_CONCURRENCY ,默认可以是机器上可用的 CPU 数量( docs )。

  • Define how to reach the services running in the main node.定义如何访问主节点中运行的服务。 Set an IP or hostname and watch out for matching exposed ports in the main node:设置 IP 或主机名并注意主节点中匹配的暴露端口:

x-airflow-common: &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
  environment: &airflow-common-env
  AIRFLOW__CORE__EXECUTOR: CeleryExecutor
  AIRFLOW__CELERY__WORKER_CONCURRENCY: 8
  AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@main_node_ip_or_hostname:5432/airflow # 5432 is default postgres port
  AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@main_node_ip_or_hostname:5432/airflow
  AIRFLOW__CELERY__BROKER_URL: redis://:@main_node_ip_or_hostname:6379/0
  environment: &airflow-common-env
    AIRFLOW__CORE__FERNET_KEY: ${FERNET_KEY}
    AIRFLOW__WEBSERVER__SECRET_KEY: ${SECRET_KEY}

  env_file:
    - .env

.env file: FERNET_KEY=jvYUaxxxxxxxxxxxxx= .env 文件: FERNET_KEY=jvYUaxxxxxxxxxxxxx=

  • It's critical that every node in the cluster (main and workers) has the same settings applied.集群中的每个节点(主节点和工作节点)都应用相同的设置至关重要

  • Define a hostname to the worker service to avoid autogenerated matching the container id.为工作服务定义一个主机名,以避免自动生成匹配容器 ID。

  • Expose port 8793, which is the default port used to fetch the logs from the worker ( docs ):公开端口 8793,这是用于从 worker ( docs ) 获取日志的默认端口:

services:
  airflow-worker:
    <<: *airflow-common
    hostname: ${HOSTNAME}
    ports:
      - 8793:8793
    command: celery worker
    restart: always
  • Make sure every worker node host is running with the same time configuration, a few minutes difference could cause serious execution errors which may not be so easy to find.确保每个工作节点主机都以相同的时间配置运行,几分钟的差异可能会导致严重的执行错误,这些错误可能不太容易找到。 Consider enabling NTP service on host OS.考虑在主机操作系统上启用 NTP 服务。

If you have heavy workloads and high concurrency in general, you may need to tune Postgres settings such as max_connections and shared_buffers .如果您有繁重的工作负载和高并发,您可能需要调整 Postgres 设置,例如max_connectionsshared_buffers The same applies to the host OS network settings such as ip_local_port_range or somaxconn .这同样适用于主机操作系统网络设置,例如ip_local_port_rangesomaxconn

In any issues I encountered during the initial cluster setup, Flower and the worker execution logs always provided helpful details and error messages, both the task-level logs and the Docker-Compose service log ie: docker-compose logs --tail=10000 airflow-worker > worker_logs.log .在我在初始集群设置过程中遇到的任何问题中, Flower和工作程序执行日志始终提供有用的详细信息和错误消息,包括任务级日志和 Docker-Compose 服务日志,即: docker-compose logs --tail=10000 airflow-worker > worker_logs.log

Hope that works for you!希望对你有用!

The following considerations build on the accepted answer , as I think they might be relevant to any new Airflow Celery setup:以下注意事项建立在公认的答案之上,因为我认为它们可能与任何新的 Airflow Celery 设置相关:

  • Enabling remote logging usually comes in handy in a distributed setup as a way to centralize logs.启用远程日志记录通常在分布式设置中派上用场,作为集中日志的一种方式。 Airflow supports remote logging natively, see eg this or this Airflow 原生支持远程日志记录,参见例如thisthis
  • Defining worker_autoscale instead of concurrency will allow to dynamically start/stop new processes when the workload increases/decreases定义worker_autoscale而不是concurrency将允许在工作负载增加/减少时动态启动/停止新进程
  • Setting the environment variable DUMB_INIT_SETSID to 0 in the worker's environment allows for warm shutdowns (see the docs )在工作人员的环境中将环境变量DUMB_INIT_SETSID设置为0允许热关机(请参阅文档
  • Adding volumes to the worker in the Docker Compose pointing at Airflow's base_log_folder allows to safely persist the worker logs locally.将卷添加到 Docker Compose 中指向 Airflow 的base_log_folder的工作人员允许在本地安全地保存工作人员日志。 Example:例子:
# docker-compose.yml

services:
  airflow-worker:
     ...
      volumes:
        - worker_logs:/airflow/logs
     ...
  ...
volumes:
  worker_logs:

I can't solve my proplem.can you help me.I used the docker develoment my airflow and celery.我无法解决我的问题。你能帮帮我吗。我用 docker 开发了我的 airflow 和 celery。 could you send me a MainNode docker-compose.yml and Woker docker-compose.yml,Thanks very much!!!你能给我发一份 MainNode docker-compose.yml 和 Woker docker-compose.yml,非常感谢!!!

*** Log file does not exist: /opt/airflow/logs/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log
*** Fetching from: http://eosbak01.zzz.ac.cn:8793/log/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log
*** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
****** See more at https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key
****** Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url 'http://eosbak01.zzz.ac.cn:8793/log/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log'
For more information check: https://httpstatuses.com/403

WorkerNode工作节点

[airflow@eosbak01 deploy]$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED              STATUS                        PORTS      NAMES
2ef39a54de97   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   About a minute ago   Up About a minute (healthy)   8080/tcp   deploy_airflow-worker_

MainNode主节点

[airflow@eosbak02 deploy]$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS                    PORTS                                                  NAMES
1e6a7e50831d   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   26 minutes ago   Up 26 minutes (healthy)   0.0.0.0:5555->5555/tcp, :::5555->5555/tcp, 8080/tcp    deploy_flower_1
9afb5985b9f3   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   27 minutes ago   Up 27 minutes (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp              deploy_airflow-webserver_1
80132177ae3d   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   27 minutes ago   Up 27 minutes (healthy)   8080/tcp                                               deploy_airflow-triggerer_1
6ea5a0ed7dec   apache/airflow:2.3.4   "/usr/bin/dumb-init …"   27 minutes ago   Up 27 minutes (healthy)   8080/tcp                                               deploy_airflow-scheduler_1
2787acb189ad   mysql:8.0.27           "docker-entrypoint.s…"   29 minutes ago   Up 29 minutes (healthy)   0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp   deploy_mysql_1
057af26f6070   redis:latest           "docker-entrypoint.s…"   29 minutes ago   Up 29 minutes (healthy)   0.0.0.0:6379->6379/tcp, :::6379->6379/tcp              deploy_redis_1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM