[英]Airflow DAG to apply on multiple BigQuery tables in the dataset
I have a BigQuery dataset with multiple tables, we call them base or source tables.我有一个包含多个表的 BigQuery 数据集,我们称它们为基表或源表。 Some external application is appending data to these base tables, some periodically, some sporadically.
一些外部应用程序正在将数据附加到这些基表,一些是周期性的,一些是偶尔的。 I want an Airflow DAG that queries these
source_table
s and insert the data of the resulting query to their counterpart bulk tables (named source_tables + '_bulk'
) daily within the same BigQuery dataset, based on some formula that applies universally to all of them.我想要一个 Airflow DAG,它每天查询这些
source_table
并将生成的查询的数据插入到同一 BigQuery 数据集中的对应批量表(名为source_tables + '_bulk'
),基于一些普遍适用于所有这些表的公式。 The sql file has a fixed query with a placeholder for the source_table
. sql 文件有一个带有
source_table
占位符的固定查询。
My DAG looks like this:我的 DAG 看起来像这样:
projectId = os.environ["GCP_PROJECT"]
dataset = <target-dataset>
dag = DAG(...)
selectInsertOp = BigQueryOperator(
...
sql=<sed_the_source_table_placeholder('sql_file.sql')>,
...
destination_dataset_table=source_table + '_bulk'
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
dag=dag
)
selectInsertOp
As the source_table
s are numerous (hundreds), how can I implement this without repeating the DAG file (and the corresponding SQL file)?由于
source_table
很多(数百个),我如何在不重复 DAG 文件(以及相应的 SQL 文件)的情况下实现这一点? I want this single DAG file to create multiple BigQueryOperator tasks.我希望这个单个 DAG 文件创建多个 BigQueryOperator 任务。
You can wrap the operator in a for loop like this if you have the list of tables necessary.如果您有必要的表列表,您可以像这样将运算符包装在 for 循环中。
for source_table in table_list:
selectInsertOp = BigQueryOperator(
...
sql=<sed_the_source_table_placeholder('sql_file.sql')>,
...
destination_dataset_table=source_table + '_bulk'
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
dag=dag
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.