[英]How to configure celery worker on distributed airflow architecture using docker-compose?
I'm setting up a distributed Airflow cluster where everything else except the celery workers are run on one host and processing is done on several hosts.我正在设置一个分布式 Airflow 集群,其中除了 celery 工作人员之外的所有其他内容都在一台主机上运行,并在多台主机上完成处理。 The airflow2.0 setup is configured using the yaml file given at the Airflow documentation https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml .
The airflow2.0 setup is configured using the yaml file given at the Airflow documentation https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml . In my initial tests I got the architecture to work nicely when I run everything at the same host.
在我最初的测试中,当我在同一台主机上运行所有东西时,我的架构可以很好地工作。 The question is, how to start the celery workers at the remote hosts?
问题是,如何在远程主机上启动 celery 工作人员?
So far, I tried to create a trimmed version of the above docker-compose where I only start the celery workers at the worker host and nothing else.到目前为止,我尝试创建上述 docker-compose 的精简版本,其中我只在工作主机上启动 celery 工作程序,仅此而已。 But I run into some issues with db connection.
但是我遇到了一些与数据库连接有关的问题。 In the trimmed version I changed the URL so that they point to the host that runs the db and redis.
在精简版本中,我更改了 URL 以便它们指向运行 db 和 redis 的主机。
dags, logs, plugins and the postgresql db are located on a shared drive that is visible to all hosts. dag、日志、插件和 postgresql db 位于所有主机可见的共享驱动器上。
How should I do the configuration?我应该如何进行配置? Any ideas what to check?
任何想法要检查什么? Connections etc.?
连接等? Celery worker docker-compose configuration:
Celery工人docker-compose配置:
---
version: '3'
x-airflow-common:
&airflow-common
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
environment:
&airflow-common-env
AIRFLOW_UID: 50000
AIRFLOW_GID: 50000
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN:
postgresql+psycopg2://airflow:airflow@airflowhost.example.com:8080/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@airflow@airflowhost.example.com:8080/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:@airflow@airflowhost.example.com:6380/0
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
REDIS_PORT: 6380
volumes:
- /airflow/dev/dags:/opt/airflow/dags
- /airflow/dev/logs:/opt/airflow/logs
- /airflow/dev/plugins:/opt/airflow/plugins
user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
services:
airflow-remote-worker:
<<: *airflow-common
command: celery worker
healthcheck:
test:
- "CMD-SHELL"
- 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
interval: 10s
timeout: 10s
retries: 5
restart: always
EDIT 1: I'm Still having some difficulties with the log files.编辑 1:我仍然对日志文件有一些困难。 It appears that sharing the log directory doesn't solve the issue of missing log files.
看来共享日志目录并不能解决丢失日志文件的问题。 I added the extra_host definition on main like suggested and opened the port 8793 on the worker machine.
我像建议的那样在 main 上添加了 extra_host 定义,并在工作机器上打开了端口 8793。 The worker tasks fail with log:
工作任务失败并显示日志:
*** Log file does not exist:
/opt/airflow/logs/tutorial/print_date/2021-07-
01T13:57:11.087882+00:00/1.log
*** Fetching from: http://:8793/log/tutorial/print_date/2021-07-01T13:57:11.087882+00:00/1.log
*** Failed to fetch log file from worker. Unsupported URL protocol ''
Far from being the "ultimate set-up", these are some settings that worked for me using the docker-compose from Airflow in the core node and the workers:这些设置远非“终极设置”,而是在核心节点和工作程序中使用来自 Airflow 的 docker-compose 对我有用的一些设置:
The worker nodes have to be reachable from the main node where the Webserver
runs.必须可以从运行
Webserver
的主节点访问工作节点。 I found this diagram of the CeleryExecutor
architecture to be very helpful to sort things out.我发现这个
CeleryExecutor
架构图非常有助于解决问题。
When trying to read the logs, if they are not found locally, it will try to retrieve them from the remote worker.在尝试读取日志时,如果在本地找不到它们,它将尝试从远程工作者那里检索它们。 Thus your main node may not know the hostname of your workers, so you either change how the hostnames are being resolved (
hostname_callable
setting, which defaults to socket.getfqdn
) or you just simply add name resolution capability to the Webserver
.因此,您的主节点可能不知道您的工作人员的主机名,因此您要么更改主机名的解析方式(
hostname_callable
设置,默认为socket.getfqdn
),要么只是简单地向Webserver
添加名称解析功能。 This could be done by adding the extra_hosts
config key in the x-airflow-common
definition:这可以通过在
x-airflow-common
extra_hosts
x-airflow-common
定义中添加extra_hosts
配置键来完成:
---
version: "3"
x-airflow-common: &airflow-common
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
environment: &airflow-common-env
...# env vars
extra_hosts:
- "worker-01-hostname:worker-01-ip-address" # "worker-01-hostname:192.168.0.11"
- "worker-02-hostname:worker-02-ip-address"
* Note that in your specific case where you have a shared drive, so I think the logs will be found locally. *请注意,在您拥有共享驱动器的特定情况下,我认为日志将在本地找到。
x-airflow-common: &airflow-common
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
environment: &airflow-common-env
AIRFLOW__CORE__PARALLELISM: 64
AIRFLOW__CORE__DAG_CONCURRENCY: 32
AIRFLOW__SCHEDULER__PARSING_PROCESSES: 4
Of course, the values to be set depend on your specific case and available resources.当然,要设置的值取决于您的具体情况和可用资源。 This article has a good overview of the subject.
这篇文章对这个主题有一个很好的概述。 DAG settings could also be overridden at
DAG
definition. DAG 设置也可以在
DAG
定义中被覆盖。
Define worker CELERY__WORKER_CONCURRENCY
, default could be the numbers of CPUs available on the machine ( docs ).定义 worker
CELERY__WORKER_CONCURRENCY
,默认可以是机器上可用的 CPU 数量( docs )。
Define how to reach the services running in the main node.定义如何访问主节点中运行的服务。 Set an IP or hostname and watch out for matching exposed ports in the main node:
设置 IP 或主机名并注意主节点中匹配的暴露端口:
x-airflow-common: &airflow-common
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.0}
environment: &airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CELERY__WORKER_CONCURRENCY: 8
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@main_node_ip_or_hostname:5432/airflow # 5432 is default postgres port
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@main_node_ip_or_hostname:5432/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:@main_node_ip_or_hostname:6379/0
environment: &airflow-common-env
AIRFLOW__CORE__FERNET_KEY: ${FERNET_KEY}
AIRFLOW__WEBSERVER__SECRET_KEY: ${SECRET_KEY}
env_file:
- .env
.env file: FERNET_KEY=jvYUaxxxxxxxxxxxxx=
.env 文件:
FERNET_KEY=jvYUaxxxxxxxxxxxxx=
It's critical that every node in the cluster (main and workers) has the same settings applied.集群中的每个节点(主节点和工作节点)都应用相同的设置至关重要。
Define a hostname to the worker service to avoid autogenerated matching the container id.为工作服务定义一个主机名,以避免自动生成匹配容器 ID。
Expose port 8793, which is the default port used to fetch the logs from the worker ( docs ):公开端口 8793,这是用于从 worker ( docs ) 获取日志的默认端口:
services:
airflow-worker:
<<: *airflow-common
hostname: ${HOSTNAME}
ports:
- 8793:8793
command: celery worker
restart: always
If you have heavy workloads and high concurrency in general, you may need to tune Postgres settings such as max_connections
and shared_buffers
.如果您有繁重的工作负载和高并发,您可能需要调整 Postgres 设置,例如
max_connections
和shared_buffers
。 The same applies to the host OS network settings such as ip_local_port_range
or somaxconn
.这同样适用于主机操作系统网络设置,例如
ip_local_port_range
或somaxconn
。
In any issues I encountered during the initial cluster setup, Flower
and the worker execution logs always provided helpful details and error messages, both the task-level logs and the Docker-Compose service log ie: docker-compose logs --tail=10000 airflow-worker > worker_logs.log
.在我在初始集群设置过程中遇到的任何问题中,
Flower
和工作程序执行日志始终提供有用的详细信息和错误消息,包括任务级日志和 Docker-Compose 服务日志,即: docker-compose logs --tail=10000 airflow-worker > worker_logs.log
。
Hope that works for you!希望对你有用!
The following considerations build on the accepted answer , as I think they might be relevant to any new Airflow Celery setup:以下注意事项建立在公认的答案之上,因为我认为它们可能与任何新的 Airflow Celery 设置相关:
worker_autoscale
instead of concurrency
will allow to dynamically start/stop new processes when the workload increases/decreasesworker_autoscale
而不是concurrency
将允许在工作负载增加/减少时动态启动/停止新进程DUMB_INIT_SETSID
to 0
in the worker's environment allows for warm shutdowns (see the docs )DUMB_INIT_SETSID
设置为0
允许热关机(请参阅文档)base_log_folder
allows to safely persist the worker logs locally.base_log_folder
的工作人员允许在本地安全地保存工作人员日志。 Example:# docker-compose.yml
services:
airflow-worker:
...
volumes:
- worker_logs:/airflow/logs
...
...
volumes:
worker_logs:
I can't solve my proplem.can you help me.I used the docker develoment my airflow and celery.我无法解决我的问题。你能帮帮我吗。我用 docker 开发了我的 airflow 和 celery。 could you send me a MainNode docker-compose.yml and Woker docker-compose.yml,Thanks very much!!!
你能给我发一份 MainNode docker-compose.yml 和 Woker docker-compose.yml,非常感谢!!!
*** Log file does not exist: /opt/airflow/logs/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log
*** Fetching from: http://eosbak01.zzz.ac.cn:8793/log/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log
*** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
****** See more at https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key
****** Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url 'http://eosbak01.zzz.ac.cn:8793/log/dag_id=example_bash_operator/run_id=scheduled__2022-09-23T00:00:00+00:00/task_id=runme_1/attempt=1.log'
For more information check: https://httpstatuses.com/403
[airflow@eosbak01 deploy]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2ef39a54de97 apache/airflow:2.3.4 "/usr/bin/dumb-init …" About a minute ago Up About a minute (healthy) 8080/tcp deploy_airflow-worker_
[airflow@eosbak02 deploy]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1e6a7e50831d apache/airflow:2.3.4 "/usr/bin/dumb-init …" 26 minutes ago Up 26 minutes (healthy) 0.0.0.0:5555->5555/tcp, :::5555->5555/tcp, 8080/tcp deploy_flower_1
9afb5985b9f3 apache/airflow:2.3.4 "/usr/bin/dumb-init …" 27 minutes ago Up 27 minutes (healthy) 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp deploy_airflow-webserver_1
80132177ae3d apache/airflow:2.3.4 "/usr/bin/dumb-init …" 27 minutes ago Up 27 minutes (healthy) 8080/tcp deploy_airflow-triggerer_1
6ea5a0ed7dec apache/airflow:2.3.4 "/usr/bin/dumb-init …" 27 minutes ago Up 27 minutes (healthy) 8080/tcp deploy_airflow-scheduler_1
2787acb189ad mysql:8.0.27 "docker-entrypoint.s…" 29 minutes ago Up 29 minutes (healthy) 0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp deploy_mysql_1
057af26f6070 redis:latest "docker-entrypoint.s…" 29 minutes ago Up 29 minutes (healthy) 0.0.0.0:6379->6379/tcp, :::6379->6379/tcp deploy_redis_1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.