[英]Airflow using ExternalTaskSensor Operator caused MySQL innodb deadlock
I use ExternalTaskSensor Operator in Airflow to manage dependencies between DAGs, My ExternalTaskSensor Operator code looks like this:我在 Airflow 中使用ExternalTask Sensor Operator 来管理 DAG 之间的依赖关系,我的ExternalTask Sensor Operator 代码如下所示:
dag = DAG(
dag_id='sushi.batch.load.application.detail.1d',
default_args=InitConf.getArgs(start_date=datetime(2021, 12, 9)),
description='Load Application Detail Data',
schedule_interval='00 */3 * * *',
tags=['sushi', 'develop']
)
monitor_handleApplicationData = ExternalTaskSensor(
task_id='wait_for_application_handle_end_detail',
execution_date_fn=lambda dt: dt + timedelta(minutes=35),
external_dag_id='sushi.batch.handle.application.1d',
external_task_id='application_handle_end',
timeout=7200,
allowed_states=['success'],
mode='reschedule',
pork_interval=60,
check_existence=True,
dag=dag,
)
The sensor running mode is reschedule
, The Sensor takes up a worker slot only when it is checking, and sleeps for a set duration between checks. sensor 运行模式为
reschedule
,Sensor 仅在检查时占用一个 worker slot,并在检查之间休眠一段时间。
But I found that Airflow scheduler crashed down because of MySQL Innodb deadlock sometime, so I had to restart the Airflow scheduler often.但是我发现 Airflow scheduler crashed down 因为 MySQL Innodb deadlock 某个时候,所以我不得不经常重启 Airflow scheduler。 And here some log that I collect in Airflow scheduler docker container:
这是我在 Airflow 调度程序 docker 容器中收集的一些日志:
sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: UPDATE task_instance SET external_executor_id=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s AND task_instance.execution_date = %s]
[parameters: (('2b14b7a2-46ef-4ec1-b16b-5f6b1f0610d2', 'wait_for_application_handle_end_detail', 'sushi.batch.load.application.detail.1d', datetime.datetime(2022, 5, 20, 0, 0)), ('4e878253-f0dd-4465-a0d1-39dbc444b882', 'wait_for_application_handle_end_dict', 'sushi.batch.application.dict.handle.1d', datetime.datetime(2022, 5, 20, 0, 0)), ('4bafb4a2-c614-41e0-bd1b-5c47dd5334aa', 'wait_for_application_handle_end_dict_test', 'sushi.batch.application.dict.handle.test.1d', datetime.datetime(2022, 5, 20, 0, 0)))]
It shows that there is one update sql caused deadlock, I call it SQL 1
:它表明有一个更新 sql 导致死锁,我称之为
SQL 1
:
UPDATE task_instance SET external_executor_id='2b14b7a2-46ef-4ec1-b16b-5f6b1f0610d2'
WHERE task_instance.task_id = 'wait_for_application_handle_end_detail'
AND task_instance.dag_id = 'sushi.batch.load.application.detail.1d'
AND task_instance.execution_date = datetime.datetime(2022, 5, 20, 0, 0)
Here's the MySQL task_instance
table schema这是MySQL
task_instance
表架构
The primary keys are task_id
, dag_id
, execution_date
.主键是
task_id
、 dag_id
、 execution_date
。 When update, innodb engine will lock rows which the condition of the task_id
column is satisfied first, it's indeed possible to deadlock if two Task with same task_id
in two different DAG. update时,innodb引擎会先锁定
task_id
列条件满足的行,如果两个不同的DAG中有相同task_id
的Task确实有可能死锁。 But my dag_id
and task_id
are both unique in all DAGs and Tasks, there's no reason caused deadlock.但是我的
dag_id
和task_id
在所有 DAGs 和 Tasks 中都是唯一的,没有理由造成死锁。 So I check the MySQL transaction log and I found another update sql, I call it SQL 2
:所以我检查了 MySQL 事务日志,发现了另一个更新 sql,我称之为
SQL 2
:
UPDATE task_instance SET state='scheduled'
WHERE task_instance.dag_id='sushi.batch.load.application.detail.1d'
AND task_instance.execution_date='2022-05-20 00:00:00'
AND task_instance.task_id. IN ('wait_for_application_handle_end_detail')
I seems know why deadlock happened, SQL 1
and SQL 2
might execute in same time and the task_id
are both wait_for_application_handle_end_detail
.我似乎知道为什么会发生死锁,
SQL 1
和SQL 2
可能同时执行并且task_id
都是wait_for_application_handle_end_detail
。 I know why SQL 2
was executed, because my ExternalTaskSensor running mode is reschedule
and poke interval is 60s
, it means that SQL 2
will execute every 60 second to change the task current state. But I don't know why SQL 1
was executed, what's external_executor_id
used for?我知道为什么
SQL 2
被执行了,因为我的ExternalTask Sensor运行模式是reschedule
,poke interval是60s
,也就是说SQL 2
会每60秒执行一次,改变当前的任务state。但是我不知道为什么SQL 1
被执行了, external_executor_id
有什么用?
I know change the running mode of ExternalTaskSensor to poke
might solve this problem, but it will takes up a worker slot for its entire runtime.我知道将ExternalTask Sensor 的运行模式更改为
poke
可能会解决这个问题,但它会在整个运行时占用一个工作槽。 Is there any other solution besides this?除了这个还有其他解决办法吗?
Both of those Updates will run faster with this composite Index:使用此复合索引,这两个更新都将运行得更快:
INDEX(dag_id, execution_date, task_id)
By being indexed and running faster, most (or maybe all) deadlocks will be prevented.通过索引和运行得更快,大多数(或可能是所有)死锁将被避免。
Even so, you should replay the query if it does encounter a deadlock.即便如此,如果确实遇到死锁,您应该重播查询。
Do you have any "transactions"?你有任何“交易”吗? (EG, with
BEGIN
and COMMIT
?) (EG,有
BEGIN
和COMMIT
?)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.