[英]In Python's Airflow, how can I stop a task from running after a certain time?
I'm trying to use Python's Airflow library. 我正在尝试使用Python的Airflow库。 I want it to scrape a web page periodically. 我希望它定期抓取网页。
The issue I'm having is that if my start_date
is several days ago, when I start the scheduler it will backfill from the start_date
to today. 我遇到的问题是,如果我的start_date
是几天前的话,那么当我启动调度程序时,它将从start_date
回填到今天。 For example: 例如:
Assume today is the 20th of the month. 假设今天是每月20号。
Assume the start_date
is the 15th of this month. 假设开始start_date
是本月15日。
If I start the scheduler on the 20th, it will scrape the page 5 times on the 20th. 如果我在20号启动调度程序,它将在20号刮5次该页面。 It will see that a DAG instance was suppose to run on the 15th, and will run that DAG instance (the one for the 15th) on the 20th. 它将看到一个DAG实例假定在15号运行,并将在20号运行该DAG实例(第15个)。 And then it will run the DAG instance for the 16th on the 20th, etc. 然后它将在20号的16号运行DAG实例,依此类推。
In short, Airflow will try to "catch up", but this doesn't make sense for web scraping. 简而言之,Airflow会尝试“追赶”,但这对于刮网没有意义。
Is there any way to make Airflow consider a DAG instance failed after a certain time? 有什么方法可以让Airflow在一段时间后将DAG实例视为失败?
This feature is in the roadmap for Airflow, but does not currently exist. 此功能在Airflow的路线图中,但当前不存在。
See: Issue #1155 请参阅: 问题#1155
You may be able to hack together a solution using BranchPythonOperator . 您可以使用BranchPythonOperator一起破解一个解决方案。 As it says in the documentation, make sure you have set depends_on_past=False
(this is the default). 如文档中所述,请确保已设置depends_on_past=False
(这是默认设置)。 I do not have airflow set up so I can't test and provide you example code at this time. 我没有设置气流,因此目前无法测试并提供示例代码。
Airflow was designed with the "backfilling" in mind so the roadmap item is against its primary logic. 气流的设计考虑了“回填”,因此路线图项目违背了其主要逻辑。
For now you can update the start_date
for this specific task or the whole dag. 现在您可以更新start_date
为这个特定的任务或整个DAG。
Every operator has a start_date http://pythonhosted.org/airflow/code.html#baseoperator 每个操作员都有一个开始日期http://pythonhosted.org/airflow/code.html#baseoperator
The scheduler is not made for being stopped. 调度程序不适合停止。 If you run it today you may set your task start_date to today, seeems logic for me. 如果今天运行它,则可以将任务start_date设置为今天,这对我来说似乎很合理。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.