简体   繁体   English

Airflow 使用 ExternalTaskSensor Operator 导致 MySQL innodb 死锁

[英]Airflow using ExternalTaskSensor Operator caused MySQL innodb deadlock

I use ExternalTaskSensor Operator in Airflow to manage dependencies between DAGs, My ExternalTaskSensor Operator code looks like this:我在 Airflow 中使用ExternalTask Sensor Operator 来管理 DAG 之间的依赖关系,我的ExternalTask Sensor Operator 代码如下所示:

dag = DAG(
    dag_id='sushi.batch.load.application.detail.1d',
    default_args=InitConf.getArgs(start_date=datetime(2021, 12, 9)),
    description='Load Application Detail Data',
    schedule_interval='00 */3 * * *',
    tags=['sushi', 'develop']
)

monitor_handleApplicationData = ExternalTaskSensor(
    task_id='wait_for_application_handle_end_detail',
    execution_date_fn=lambda dt: dt + timedelta(minutes=35),
    external_dag_id='sushi.batch.handle.application.1d',
    external_task_id='application_handle_end',
    timeout=7200,
    allowed_states=['success'],
    mode='reschedule',
    pork_interval=60,
    check_existence=True,
    dag=dag,
)

The sensor running mode is reschedule , The Sensor takes up a worker slot only when it is checking, and sleeps for a set duration between checks. sensor 运行模式为reschedule ,Sensor 仅在检查时占用一个 worker slot,并在检查之间休眠一段时间。

But I found that Airflow scheduler crashed down because of MySQL Innodb deadlock sometime, so I had to restart the Airflow scheduler often.但是我发现 Airflow scheduler crashed down 因为 MySQL Innodb deadlock 某个时候,所以我不得不经常重启 Airflow scheduler。 And here some log that I collect in Airflow scheduler docker container:这是我在 Airflow 调度程序 docker 容器中收集的一些日志:

sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: UPDATE task_instance SET external_executor_id=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s AND task_instance.execution_date = %s]
[parameters: (('2b14b7a2-46ef-4ec1-b16b-5f6b1f0610d2', 'wait_for_application_handle_end_detail', 'sushi.batch.load.application.detail.1d', datetime.datetime(2022, 5, 20, 0, 0)), ('4e878253-f0dd-4465-a0d1-39dbc444b882', 'wait_for_application_handle_end_dict', 'sushi.batch.application.dict.handle.1d', datetime.datetime(2022, 5, 20, 0, 0)), ('4bafb4a2-c614-41e0-bd1b-5c47dd5334aa', 'wait_for_application_handle_end_dict_test', 'sushi.batch.application.dict.handle.test.1d', datetime.datetime(2022, 5, 20, 0, 0)))]

It shows that there is one update sql caused deadlock, I call it SQL 1 :它表明有一个更新 sql 导致死锁,我称之为SQL 1

UPDATE task_instance SET external_executor_id='2b14b7a2-46ef-4ec1-b16b-5f6b1f0610d2'
       WHERE task_instance.task_id = 'wait_for_application_handle_end_detail'
       AND task_instance.dag_id = 'sushi.batch.load.application.detail.1d'
       AND task_instance.execution_date = datetime.datetime(2022, 5, 20, 0, 0)

Here's the MySQL task_instance table schema这是MySQL task_instance表架构

The primary keys are task_id , dag_id , execution_date .主键是task_iddag_idexecution_date When update, innodb engine will lock rows which the condition of the task_id column is satisfied first, it's indeed possible to deadlock if two Task with same task_id in two different DAG. update时,innodb引擎会先锁定task_id列条件满足的行,如果两个不同的DAG中有相同task_id的Task确实有可能死锁。 But my dag_id and task_id are both unique in all DAGs and Tasks, there's no reason caused deadlock.但是我的dag_idtask_id在所有 DAGs 和 Tasks 中都是唯一的,没有理由造成死锁。 So I check the MySQL transaction log and I found another update sql, I call it SQL 2 :所以我检查了 MySQL 事务日志,发现了另一个更新 sql,我称之为SQL 2

UPDATE task_instance SET state='scheduled' 
       WHERE task_instance.dag_id='sushi.batch.load.application.detail.1d'
       AND task_instance.execution_date='2022-05-20 00:00:00' 
       AND task_instance.task_id. IN ('wait_for_application_handle_end_detail')

I seems know why deadlock happened, SQL 1 and SQL 2 might execute in same time and the task_id are both wait_for_application_handle_end_detail .我似乎知道为什么会发生死锁, SQL 1SQL 2可能同时执行并且task_id都是wait_for_application_handle_end_detail I know why SQL 2 was executed, because my ExternalTaskSensor running mode is reschedule and poke interval is 60s , it means that SQL 2 will execute every 60 second to change the task current state. But I don't know why SQL 1 was executed, what's external_executor_id used for?我知道为什么SQL 2被执行了,因为我的ExternalTask Sensor运行模式是reschedule ,poke interval是60s ,也就是说SQL 2会每60秒执行一次,改变当前的任务state。但是我不知道为什么SQL 1被执行了, external_executor_id有什么用?

I know change the running mode of ExternalTaskSensor to poke might solve this problem, but it will takes up a worker slot for its entire runtime.我知道将ExternalTask Sensor 的运行模式更改为poke可能会解决这个问题,但它会在整个运行时占用一个工作槽。 Is there any other solution besides this?除了这个还有其他解决办法吗?

Both of those Updates will run faster with this composite Index:使用此复合索引,这两个更新都将运行得更快:

INDEX(dag_id, execution_date, task_id)

By being indexed and running faster, most (or maybe all) deadlocks will be prevented.通过索引和运行得更快,大多数(或可能是所有)死锁将被避免。

Even so, you should replay the query if it does encounter a deadlock.即便如此,如果确实遇到死锁,您应该重播查询。

Do you have any "transactions"?你有任何“交易”吗? (EG, with BEGIN and COMMIT ?) (EG,有BEGINCOMMIT ?)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM