简体   繁体   中英

Airflow - how can I get data from a BigQuery table and use it as a list?

I'm trying to get a column, then use values to create file names.

I've tried the following, which should create a csv with the name of the first value in the column specified. It says the list is empty though when I try to use it

bq_data = []
get_data = BigQueryGetDataOperator(
    task_id='get_data_from_bq',
    dataset_id='SK22',
    table_id='current_times',
    max_results='100',
    selected_fields='current_timestamps',
)


def process_data_from_bq(**kwargs):
    ti = kwargs['ti']
    global bq_data
    bq_data = ti.xcom_pull(task_ids='get_data_from_bq')


process_data = PythonOperator(
        task_id='process_data_from_bq',
        python_callable=process_data_from_bq,
        provide_context=True)
run_export = BigQueryToCloudStorageOperator(
        task_id=f"save_data_on_storage{str(bq_data[0])}",
        source_project_dataset_table="a-data-set",
        destination_cloud_storage_uris=[f"gs://europe-west1-airflow-bucket/data/test{bq_data[0]}.csv"],
        export_format="CSV",
        field_delimiter=",",
        print_header=False,
        dag=dag,
    )

get_data >> process_data >> run_export

I think no need to use a PythonOperator between BigQueryGetDataOperator and BigQueryToCloudStorageOperator , you can directly use xcom pull in BigQueryToCloudStorageOperator :

get_data = BigQueryGetDataOperator(
    task_id='get_data_from_bq',
    dataset_id='SK22',
    table_id='current_times',
    max_results='100',
    selected_fields='current_timestamps',
)

run_export = BigQueryToCloudStorageOperator(
        task_id="save_data_on_storage",
        source_project_dataset_table="a-data-set",
        destination_cloud_storage_uris=[f"gs://europe-west1-airflow-bucket/data/test" + "{{ ti.xcom_pull(task_ids='get_data_from_bq')[0] }}" + ".csv"],
        export_format="CSV",
        field_delimiter=",",
        print_header=False,
        dag=dag,
    )

get_data >> run_export

destination_cloud_storage_uris is a templated param and you can pass Jinja template syntax inside.

I don't tested the syntax but it should work.

I also don't recommend you using global variable like bq_data to pass data between operators, because it doesn't work, you need to find a way to use xcom directly in the operator ( Jinja template or access to the current Context of the operator).

I also noticed that you are not using the latest Airflow operators:

If you want using all the list provided by BigQueryGetDataOperator operator and calculate a list of destination URIs from it, I propose you another solution :

from typing import List, Dict

from airflow.providers.google.cloud.transfers.bigquery_to_gcs import BigQueryToGCSOperator

class CustomBigQueryToGCSOperator(BigQueryToGCSOperator):

    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)

    def execute(self, context):
        task_instance = context['task_instance']
        data_from_bq: List[Dict] = task_instance.xcom_pull('get_data_from_bq')

        destination_cloud_storage_uris: List[str] = list(map(self.to_destination_cloud_storage_uris, data_from_bq))

        self.destination_cloud_storage_uris = destination_cloud_storage_uris

        super(CustomBigQueryToGCSOperator, self).execute(context)

    def to_destination_cloud_storage_uris(self, data_from_bq: Dict) -> str:
        return f"gs://europe-west1-airflow-bucket/data/test{data_from_bq['your_field']}.csv"

Some explanations:

  • I created a custom operator that extends BigQueryToGCSOperator
  • In the execute method, I have access to the current context of the operator
  • From the context, I can retrieve the list from BQ provided by the BigQueryGetDataOperator . I assume it's a list of Dict but you have to confirm this
  • I calculate a list of destination GCS URIs from this list of Dict
  • I assign the calculated destination GCS URIs to the corresponding field in the operator

The pros of this solution, you have more flexibility to apply logic based on xcom value.

The cons is it's little verbose.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM