简体   繁体   中英

Google Cloud Platform Data ETL Batch Processing: Cloud Function Dataflow

I am in my new company as a data engineer working on building google cloud platform (GCP) batch ETL pipelines. My team's data scientist has recently passed me a data model (.py file written with python 3.6).

The data model has a main function that I can call and get a dataframe as an output, I intend to append this dataframe to a bigquery table. Is there anyway that I can just import this main function and integrate it into a pipeline using apache beam (Dataflow), without having to re-code the data model into PTransform? Or would I be better off just using cloud scheduler and cloud functions to achieve what I want?

I am a complete beginner with dataflow and apache beam, so any help or links to guides would be greatly appreciated!

Instead of converting to a CSV file and loading it into BigQuery (which is a longer and potentially costlier process) you could take advantage of BigQuery's built-in integrations with Pandas :

import pandas

df = pandas.DataFrame(
{
    'my_string': ['a', 'b', 'c'],
    'my_int64': [1, 2, 3],
    'my_float64': [4.0, 5.0, 6.0],
}
)
full_table_id = 'my_dataset.new_table'
project_id = 'my-project-id'

df.to_gbq(full_table_id, project_id=project_id)

In terms of pipeline orchestration, I personally like Apache Airflow which integrates well with Cloud Composer

Edit: check out the extensive docs for df.to_gbq for more control over how you are loading your dataframe.

If you have a dataframe, the easiest way is to transform it into CSV and then load it into BigQuery (load job)

Don't forget to add job_config.write_disposition = 'WRITE_APPEND' to add the data to the existing table. Look into the other param, schema autodetect, CSV seperator, skip leading row can help you to achieve your load.

Do it in a function, or Cloud Run, trigger by a scheduler. It works well!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM