I am currently working on a DAG which does the same tasks for different datasets in one DAG definition. The list of the datasets and their schema needs to be accessed from some configuration. I have ideas about where to store the configuration, but I can't figure it out how to read the configuration, than start iterating on tasks based on the result of this configration.
My code currently looks like this:
# Configuration read logic will be implemented here, working with Variable during test phase.
def _read_config(ti):
Variable.set("table_list", ["first", "second", "third"], serialize_json=True)
# The actual processing logic will be implemented here.
def _processing(table):
print("Current table:")
print(table)
def processing(table):
return PythonOperator(
task_id=f"processing_{table}",
python_callable=_processing,
op_kwargs={
"table": table
}
)
def scheduling():
for table in Variable.get("table_list", deserialize_json=True)():
processing(table)
read_config = PythonOperator(
task_id='read_config',
python_callable=_read_config
)
scheduling = PythonOperator(
task_id='scheduling',
python_callable=_scheduling
)
read_config >> scheduling
But that results in the following graph:
What I want to achieve is this:
Is there a proper way to to this in Airflow? I am open to new suggestions, the only important thing is to perform these 3 steps properly.
Generating tasks dynamically in this way is not recommended in Airflow, instead Airflow provides since 2.3.0
a clean way to do it using Dynamic Task Mapping .
with DAG(dag_id="dag id", start_date=...) as dag:
@task
def read_config():
# here load the config from variables, db, S3/GCS file, ...
return ["first", "second", "third"]
@task
def processing(table: str):
# implement your processing function
print(f"Current table: {table}")
config = read_config()
processing.expand(table=config)
Airflow will store the config returned in the first task as a xcom, then the second task will pull it, and will run an instance for each element in the list. In the UI you will find one instance with a number of mapped instances, but don't worry, if you have multiple workers, all the instances will be executed in parallel on different workers (it depends on your parallelism config).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.