简体   繁体   English

Airflow - 如何从 BigQuery 表中获取数据并将其用作列表?

[英]Airflow - how can I get data from a BigQuery table and use it as a list?

I'm trying to get a column, then use values to create file names.我正在尝试获取一列,然后使用值来创建文件名。

I've tried the following, which should create a csv with the name of the first value in the column specified.我尝试了以下方法,它应该创建一个 csv,其名称为指定列中的第一个值。 It says the list is empty though when I try to use it它说列表是空的,但当我尝试使用它时

bq_data = []
get_data = BigQueryGetDataOperator(
    task_id='get_data_from_bq',
    dataset_id='SK22',
    table_id='current_times',
    max_results='100',
    selected_fields='current_timestamps',
)


def process_data_from_bq(**kwargs):
    ti = kwargs['ti']
    global bq_data
    bq_data = ti.xcom_pull(task_ids='get_data_from_bq')


process_data = PythonOperator(
        task_id='process_data_from_bq',
        python_callable=process_data_from_bq,
        provide_context=True)
run_export = BigQueryToCloudStorageOperator(
        task_id=f"save_data_on_storage{str(bq_data[0])}",
        source_project_dataset_table="a-data-set",
        destination_cloud_storage_uris=[f"gs://europe-west1-airflow-bucket/data/test{bq_data[0]}.csv"],
        export_format="CSV",
        field_delimiter=",",
        print_header=False,
        dag=dag,
    )

get_data >> process_data >> run_export

I think no need to use a PythonOperator between BigQueryGetDataOperator and BigQueryToCloudStorageOperator , you can directly use xcom pull in BigQueryToCloudStorageOperator :我认为不需要在BigQueryGetDataOperatorBigQueryToCloudStorageOperator之间使用PythonOperator ,您可以直接在BigQueryToCloudStorageOperator中使用xcom pull

get_data = BigQueryGetDataOperator(
    task_id='get_data_from_bq',
    dataset_id='SK22',
    table_id='current_times',
    max_results='100',
    selected_fields='current_timestamps',
)

run_export = BigQueryToCloudStorageOperator(
        task_id="save_data_on_storage",
        source_project_dataset_table="a-data-set",
        destination_cloud_storage_uris=[f"gs://europe-west1-airflow-bucket/data/test" + "{{ ti.xcom_pull(task_ids='get_data_from_bq')[0] }}" + ".csv"],
        export_format="CSV",
        field_delimiter=",",
        print_header=False,
        dag=dag,
    )

get_data >> run_export

destination_cloud_storage_uris is a templated param and you can pass Jinja template syntax inside. destination_cloud_storage_uris是一个模板化参数,您可以在其中传递Jinja模板语法。

I don't tested the syntax but it should work.我没有测试语法,但它应该可以工作。

I also don't recommend you using global variable like bq_data to pass data between operators, because it doesn't work, you need to find a way to use xcom directly in the operator ( Jinja template or access to the current Context of the operator).我也不建议你使用像bq_data这样的全局变量在operator之间传递数据,因为它不起作用,你需要找到一种方法直接在operator中使用xcomJinja模板或访问operator的当前Context ).

I also noticed that you are not using the latest Airflow operators:我还注意到您没有使用最新的Airflow运算符:

If you want using all the list provided by BigQueryGetDataOperator operator and calculate a list of destination URIs from it, I propose you another solution :如果您想使用 BigQueryGetDataOperator 运算符提供的所有列表并从中计算目标 URI 列表,我建议您使用另一种解决方案

from typing import List, Dict

from airflow.providers.google.cloud.transfers.bigquery_to_gcs import BigQueryToGCSOperator

class CustomBigQueryToGCSOperator(BigQueryToGCSOperator):

    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)

    def execute(self, context):
        task_instance = context['task_instance']
        data_from_bq: List[Dict] = task_instance.xcom_pull('get_data_from_bq')

        destination_cloud_storage_uris: List[str] = list(map(self.to_destination_cloud_storage_uris, data_from_bq))

        self.destination_cloud_storage_uris = destination_cloud_storage_uris

        super(CustomBigQueryToGCSOperator, self).execute(context)

    def to_destination_cloud_storage_uris(self, data_from_bq: Dict) -> str:
        return f"gs://europe-west1-airflow-bucket/data/test{data_from_bq['your_field']}.csv"

Some explanations:一些解释:

  • I created a custom operator that extends BigQueryToGCSOperator我创建了一个扩展BigQueryToGCSOperator的自定义运算符
  • In the execute method, I have access to the current context of the operatorexecute方法中,我可以访问操作员的当前上下文
  • From the context, I can retrieve the list from BQ provided by the BigQueryGetDataOperator .从上下文中,我可以从BigQueryGetDataOperator提供的BQ中检索列表。 I assume it's a list of Dict but you have to confirm this我假设这是一个 Dict 列表,但你必须确认这一点
  • I calculate a list of destination GCS URIs from this list of Dict我从这个字典列表中计算出一个目标GCS URI 列表
  • I assign the calculated destination GCS URIs to the corresponding field in the operator我将计算出的目标GCS URI 分配给运算符中的相应字段

The pros of this solution, you have more flexibility to apply logic based on xcom value.此解决方案的优点是,您可以更灵活地应用基于 xcom 值的逻辑。

The cons is it's little verbose.缺点是它有点冗长。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Airflow:如何获取数据插入BigQuery表的当前日期? - Airflow: How to get the current date of when data is inserted into a BigQuery table? 如何从 python 中的 BigQuery 获取表名 - How can I get Table Name from BigQuery in python 我可以在 BigQuery 中使用 CALL 创建表或从存储过程结果插入数据吗? - Can i create table or insert data from stored procedure results use CALL in BigQuery? Airflow:BigQuery SQL 向表中插入空数据 - Airflow: BigQuery SQL Insert empty data to the table Airflow:如何将数据从 REST API 加载到 BigQuery? - Airflow: How to load data from a REST API to BigQuery? 我怎样才能只从 BigQuery 中的数组获取键 - How can I only get keys from an array in BigQuery 如何从 bigquery-public-data 在 bigquery 中创建表 - How to create table in the bigquery from the bigquery-public-data Airflow DAG:如何使用 Python 运算符而不是 BigQuery 运算符将数据插入表中? - Airflow DAG: How to insert data into a table using Python operator, not BigQuery operator? 我如何覆盖(而不是追加)Airflow“DataflowTemplateOperator()”中目标表中的数据? - How can i overwrite (instead of append) the data in destination table in Airflow "DataflowTemplateOperator()"? 如何在另一个任务气流中使用查询(bigquery operator)的结果 - How to use the result of a query (bigquery operator) in another task-airflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM