简体   繁体   English

Airflow XCOM 拉取和推送 BigQueryInsertJobOperator 和 BigQueryOperator

[英]Airflow XCOM pull and push for a BigQueryInsertJobOperator and BigQueryOperator

I am very new to airflow and I am trying to create a DAG based on the below requirement.我对 airflow 非常陌生,我正在尝试根据以下要求创建一个 DAG。

  • Task 1 - Run a Bigquery query to get a value which I need to push to 2nd task in the dag任务 1 - 运行 Bigquery 查询以获取我需要推送到 dag 中的第二个任务的值
  • Task 2 - Use the value from the above query and run another query and export the data into google cloud bucket.任务 2 - 使用上述查询中的值并运行另一个查询并将数据导出到谷歌云存储桶中。

I have read other answers related to this and I understand we cannot use xcom_pull or xcom_push in bigqueryoperator in airflow.我已经阅读了与此相关的其他答案,并且我知道我们不能在 airflow 的 bigqueryoperator 中使用 xcom_pull 或 xcom_push。 So what I am doing is using a python operator where I can use jinja template variables by using "provide_context=True".所以我正在做的是使用 python 运算符,我可以通过使用“provide_context = True”来使用 jinja 模板变量。

Below is the snipped of my code.下面是我的代码的片段。 Just the task 1 where I want to do "task_instance.xcom_push" in order to see the value in airflow under logs xcom.只是我想要执行“task_instance.xcom_push”的任务 1,以便在日志 xcom 下查看 airflow 中的值。

def get_bq_operator(dag, task_id, configuration, table_params=None, trigger_rule='all_success'):
    bq_operator = BigQueryInsertJobOperator(
        task_id=task_id,
        configuration=configuration,
        gcp_conn_id=gcp_connection_id,
        dag=dag,
        params=table_params,
        trigger_rule=trigger_rule,
        **task_instance.xcom_push(key='yr_wk', value=yr_wk),**
    )
    return bq_operator


def get_bq_wm_yr_wk():
    get_bq_operator(dag,app_name,bigquery_util.get_bq_job_configuration(
                                             bq_query,
                                             query_params=None))

get_wm_yr_wk = PythonOperator(task_id='get_wm_yr_wk',
                                        python_callable=get_bq_wm_yr_wk,
                                        provide_context=True,
                                        on_failure_callback=failure_callback,
                                        on_retry_callback=failure_callback,
                                        dag=dag)

"bq_query" is the one I am passing the sql file which has my query and the query returns the value of yr_wk which I need to use in my 2nd task. “bq_query”是我传递 sql 文件的那个,它有我的查询,查询返回我需要在第二个任务中使用的 yr_wk 的值。

The highlighted task_instance.xcom_push(key='yr_wk', value=yr_wk), in get_bq_operator is failing and the errror i am getting is as below get_bq_operator 中突出显示的task_instance.xcom_push(key='yr_wk', value=yr_wk)失败,我得到的错误如下

raise KeyError(f'Variable {key} does not exist')

KeyError: 'Variable ei_migration_hour does not exist' KeyError:'变量 ei_migration_hour 不存在'

If I comment the line above, the DAG runs fine.如果我评论上面的行,则 DAG 运行良好。 However, how do I validate the value of yr_wk??但是,如何验证 yr_wk 的值? I want to push it so that I can view the value in logs.我想推送它,以便我可以查看日志中的值。

I do not fully understand your code:), but if you want to do something with results of BigQuery query, then by far better way to approach it is to use BigQueryHook in your python callable.我不完全理解你的代码:),但是如果你想对 BigQuery 查询的结果做一些事情,那么更好的方法是在你的 python 调用中使用 BigQueryHook。

Operators in Airflow are usually thin wrappers around Hooks that really provide a "complete" taks (for example you can use it run an update operation) but if you want to do something with the result of it and you already do it via Python Operator, it is far better to use Hooks directly as you do not make all the assumptions that operators have in execute method. Airflow 中的运算符通常是真正提供“完整”任务的 Hooks 周围的薄包装器(例如,您可以使用它运行更新操作)但是如果您想对它的结果做一些事情并且您已经通过 Python 运算符进行了操作,直接使用 Hooks 会好得多,因为您不会做出运算符在execute方法中的所有假设。

In your case it should be something like (and I am using here the new TaskFlow syntax which is preferred to do this kind of operations. See https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html for the tutorial on Task Flow API. Aspecially in Airflow 2 it became the de-facto default way of writing tasks. In your case it should be something like (and I am using here the new TaskFlow syntax which is preferred to do this kind of operations. See https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html对于任务流 API 的教程。特别是在 Airflow 2 中,它成为事实上的默认任务编写方式。

@task(.....) 
def my_task():
   hook = BigQueryHook(....)  # initialize it with the right parameters
   result = hook.run(sql='YOUR_QUERY', ...)  # add other necessary params
   processed_result = process_result(result) # do something with the result
   return processed_result

This way you do not evey have to run xcom_push (task_flow API will do it for you automatically and other tasks will be able to use by just doing:这样您就不必运行 xcom_push (task_flow API 会自动为您执行此操作,其他任务只需执行以下操作即可使用:

@task
next_task(input):
   pass

And then:接着:

result = my_task()
next_task(result)

Then all the xcom push/pull will be handled for you automatically via TaskFlow.然后所有的 xcom 推/拉都将通过 TaskFlow 自动为您处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM