Airflow 使用 ExternalTaskSensor Operator 导致 MySQL innodb 死锁

Question

I use ExternalTaskSensor Operator in Airflow to manage dependencies between DAGs, My ExternalTaskSensor Operator code looks like this:我在 Airflow 中使用ExternalTask Sensor Operator 来管理 DAG 之间的依赖关系，我的ExternalTask Sensor Operator 代码如下所示：

dag = DAG(
    dag_id='sushi.batch.load.application.detail.1d',
    default_args=InitConf.getArgs(start_date=datetime(2021, 12, 9)),
    description='Load Application Detail Data',
    schedule_interval='00 */3 * * *',
    tags=['sushi', 'develop']
)

monitor_handleApplicationData = ExternalTaskSensor(
    task_id='wait_for_application_handle_end_detail',
    execution_date_fn=lambda dt: dt + timedelta(minutes=35),
    external_dag_id='sushi.batch.handle.application.1d',
    external_task_id='application_handle_end',
    timeout=7200,
    allowed_states=['success'],
    mode='reschedule',
    pork_interval=60,
    check_existence=True,
    dag=dag,
)

The sensor running mode is reschedule , The Sensor takes up a worker slot only when it is checking, and sleeps for a set duration between checks. sensor 运行模式为reschedule ，Sensor 仅在检查时占用一个 worker slot，并在检查之间休眠一段时间。

But I found that Airflow scheduler crashed down because of MySQL Innodb deadlock sometime, so I had to restart the Airflow scheduler often.但是我发现 Airflow scheduler crashed down 因为 MySQL Innodb deadlock 某个时候，所以我不得不经常重启 Airflow scheduler。 And here some log that I collect in Airflow scheduler docker container:这是我在 Airflow 调度程序 docker 容器中收集的一些日志：

sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: UPDATE task_instance SET external_executor_id=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s AND task_instance.execution_date = %s]
[parameters: (('2b14b7a2-46ef-4ec1-b16b-5f6b1f0610d2', 'wait_for_application_handle_end_detail', 'sushi.batch.load.application.detail.1d', datetime.datetime(2022, 5, 20, 0, 0)), ('4e878253-f0dd-4465-a0d1-39dbc444b882', 'wait_for_application_handle_end_dict', 'sushi.batch.application.dict.handle.1d', datetime.datetime(2022, 5, 20, 0, 0)), ('4bafb4a2-c614-41e0-bd1b-5c47dd5334aa', 'wait_for_application_handle_end_dict_test', 'sushi.batch.application.dict.handle.test.1d', datetime.datetime(2022, 5, 20, 0, 0)))]

It shows that there is one update sql caused deadlock, I call it SQL 1 :它表明有一个更新 sql 导致死锁，我称之为SQL 1 ：

UPDATE task_instance SET external_executor_id='2b14b7a2-46ef-4ec1-b16b-5f6b1f0610d2'
       WHERE task_instance.task_id = 'wait_for_application_handle_end_detail'
       AND task_instance.dag_id = 'sushi.batch.load.application.detail.1d'
       AND task_instance.execution_date = datetime.datetime(2022, 5, 20, 0, 0)

Here's the MySQL task_instance table schema这是MySQL task_instance表架构

The primary keys are task_id , dag_id , execution_date .主键是task_id 、 dag_id 、 execution_date 。 When update, innodb engine will lock rows which the condition of the task_id column is satisfied first, it's indeed possible to deadlock if two Task with same task_id in two different DAG. update时，innodb引擎会先锁定task_id列条件满足的行，如果两个不同的DAG中有相同task_id的Task确实有可能死锁。 But my dag_id and task_id are both unique in all DAGs and Tasks, there's no reason caused deadlock.但是我的dag_id和task_id在所有 DAGs 和 Tasks 中都是唯一的，没有理由造成死锁。 So I check the MySQL transaction log and I found another update sql, I call it SQL 2 :所以我检查了 MySQL 事务日志，发现了另一个更新 sql，我称之为SQL 2 ：

UPDATE task_instance SET state='scheduled' 
       WHERE task_instance.dag_id='sushi.batch.load.application.detail.1d'
       AND task_instance.execution_date='2022-05-20 00:00:00' 
       AND task_instance.task_id. IN ('wait_for_application_handle_end_detail')

I seems know why deadlock happened, SQL 1 and SQL 2 might execute in same time and the task_id are both wait_for_application_handle_end_detail .我似乎知道为什么会发生死锁， SQL 1和SQL 2可能同时执行并且task_id都是wait_for_application_handle_end_detail 。 I know why SQL 2 was executed, because my ExternalTaskSensor running mode is reschedule and poke interval is 60s , it means that SQL 2 will execute every 60 second to change the task current state. But I don't know why SQL 1 was executed, what's external_executor_id used for?我知道为什么SQL 2被执行了，因为我的ExternalTask Sensor运行模式是reschedule ，poke interval是60s ，也就是说SQL 2会每60秒执行一次，改变当前的任务state。但是我不知道为什么SQL 1被执行了， external_executor_id有什么用？

I know change the running mode of ExternalTaskSensor to poke might solve this problem, but it will takes up a worker slot for its entire runtime.我知道将ExternalTask Sensor 的运行模式更改为poke可能会解决这个问题，但它会在整个运行时占用一个工作槽。 Is there any other solution besides this?除了这个还有其他解决办法吗？

Answer 1

Both of those Updates will run faster with this composite Index:使用此复合索引，这两个更新都将运行得更快：

INDEX(dag_id, execution_date, task_id)

By being indexed and running faster, most (or maybe all) deadlocks will be prevented.通过索引和运行得更快，大多数（或可能是所有）死锁将被避免。

Even so, you should replay the query if it does encounter a deadlock.即便如此，如果确实遇到死锁，您应该重播查询。

Do you have any "transactions"?你有任何“交易”吗？ (EG, with BEGIN and COMMIT ?) （EG，有BEGIN和COMMIT ？）

Airflow 使用 ExternalTaskSensor Operator 导致 MySQL innodb 死锁

问题描述

1 个解决方案

解决方案1
0 2022-05-23 05:29:07

Airflow 使用 ExternalTaskSensor Operator 导致 MySQL innodb 死锁

问题描述

1 个解决方案

解决方案1 0 2022-05-23 05:29:07

解决方案1
0 2022-05-23 05:29:07