简体   繁体   中英

In Python's Airflow, how can I stop a task from running after a certain time?

I'm trying to use Python's Airflow library. I want it to scrape a web page periodically.

The issue I'm having is that if my start_date is several days ago, when I start the scheduler it will backfill from the start_date to today. For example:

Assume today is the 20th of the month.

Assume the start_date is the 15th of this month.

If I start the scheduler on the 20th, it will scrape the page 5 times on the 20th. It will see that a DAG instance was suppose to run on the 15th, and will run that DAG instance (the one for the 15th) on the 20th. And then it will run the DAG instance for the 16th on the 20th, etc.

In short, Airflow will try to "catch up", but this doesn't make sense for web scraping.

Is there any way to make Airflow consider a DAG instance failed after a certain time?

This feature is in the roadmap for Airflow, but does not currently exist.

See: Issue #1155

You may be able to hack together a solution using BranchPythonOperator . As it says in the documentation, make sure you have set depends_on_past=False (this is the default). I do not have airflow set up so I can't test and provide you example code at this time.

Airflow was designed with the "backfilling" in mind so the roadmap item is against its primary logic.

For now you can update the start_date for this specific task or the whole dag.

Every operator has a start_date http://pythonhosted.org/airflow/code.html#baseoperator

The scheduler is not made for being stopped. If you run it today you may set your task start_date to today, seeems logic for me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM