简体   繁体   中英

Transfer data from Big query to Amazon S3 using Airflow

How to send data from Big query to Amazon s3 using Airflow operators?

What are the operators that I need to use? I am stuck in the middle of the process.

Here is my code so far

bq_check_date = BigQueryCheckOperator(
    task_id='bq_check_date',
    sql='''
    SELECT
    *
    FROM
    `myproject.test.test_table`
    ''',
    use_legacy_sql=False,
    bigquery_conn_id=GCP_CONN_ID,
    dag=dag
)
get_data = BigQueryGetDataOperator(
    task_id="get_data",
    dataset_id=BQ_DATASET,
    table_id=BQ_TABLE,
    location=LOCATION,
    dag = dag,
    bigquery_conn_id=GCP_CONN_ID,
)

What next? Any ideas, thank you very much !

There are two ways usually:

  1. You find an existing service transferring the data and use it. For example GCP has such service called BigQueryTransferService that can transfer (for example) S3 to BigQuery (but not the other way round). And you can run orchestrate it from Airflow using existing opperators or BashOperator + gcloud. I am not aware of any AWS service like that (usually you'd look for such services delivered by the "target" of your data transfer as they are more intereset in getting more data rather than sending it out). Ig you find such service, you might write your custom operator using it. Writing an Operator is surprisingly easy if there is no ready operator available or you can write a Bash or Python Operator script doing so.

  2. In case when there is no "service" you can tap into, then you need to have so called "Transfer" operator. Which usually works in the way that it creates two Hooks (for source and target) and passes data between those - either via streaming or by extracting data from the source and storing it locally as first step and uploading it in the second step. Those two Hooks needs to be used in single Operator execution so that they can share the local data or stream. I do not think there is an existing BigQuery to S3 operator, but - again you can easily write your own custom operator that will use BigQueryHook and S3Hook and pass the data between those two.

For the second case, you can take some existing operators in Airflow as example (they are usually in "providers" and the transfers operators are usually in "transfers" package. Example here: https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/transfers/gcs_to_s3.py . As you look there there is a lot of documentation and some boilerplate code and handling multiple files but the gist of it is:

    file_bytes = hook.download(object_name=file, bucket_name=self.bucket)
    s3_hook.load_bytes(cast(bytes, file_bytes), key=dest_key, replace=self.replace, acl_policy=self.s3_acl_policy)

As you see - in this case the GCS file is downloade locally (to memory in this case) and uploaded to S3 using hooks.

Hooks are very important entity in Airflow - they encapsulate authentication and provide simple API to services, so writing operators when you already have hooks is very easy.

I know it's been a long time since the question was made but this may help others in the future.

In the same context as Jarek placed it, you can start from SQLtoS3 operator which is really close to what you want to achieve in order to build you own operator.

You need to change the _get_hook method to return a BigQueryHook just by initialising an BigQueryHook object (don't use get_connection and get_hook methods). Also in the get_pandas_df method in the execute method you may need to adjust the dialect used for your queries I had problems with default legacy needed to change it to standard.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM