简体繁体中英

Can I use google DataFlow with native python?

原文 2021-08-23 13:51:26 5 2 python/ google-cloud-dataflow/ apache-beam/ serverless

I'm trying to build a python ETL pipeline in google cloud, and google cloud dataflow seemed a good option. When I explored the documentation and the developer guides, I see that the apache beam is always attached to dataflow as it's based on it. I may find issues processing my dataframes in apache beam.

My questions are:

if I want to build my ETL script in native python with DataFlow is that possible? Or it's necessary to use apache beam for my ETL?
If DataFlow was built just for the purpose of using Apache Beam? Is there any serverless google cloud tool for building python ETL (Google cloud function has 9 minutes time execution, that may cause some issues for my pipeline, I want to avoid in execution limit)

My pipeline aims to read data from BigQuery process it and re save it in a bigquery table. I may use some external APIs inside my script.

2 answers

Concerning your first question, it looks like Dataflow was primarly written for using it along the Apache SDK, as can be checked in the official Google Cloud Documentation on Dataflow . So, it is possible that's actually a requirement to use Apache Beam for your ETL.

Regarding your second question, this tutorial gives you a guidance on how to build your own ETL Pipeline with Python and Google Cloud Platform functions, which are actually serverless. Could you please confirm if this link has helped you?

Regarding your first question, Dataflow needs to use Apache Beam. In fact, before Apache Beam there was something called Dataflow SDK, which was Google proprietary and then it was open sourced to Apache Beam.

The Python Beam SDK is rather easy once you put a bit of effort into it, and the main process operations you'd need are very close to native Python language.

If your end goal is to read, process and write to BQ, I'd say Beam + Dataflow is a good match.

How can I tell Dataflow to "use_unsupported_python_version" with PipelineOptions?

Can't use ping command in Google Cloud Dataflow vm?

Python logs not appearing in Google Dataflow

Can I dynamically alter log levels in Google Dataflow once the job has started?

Can I configure Google DataFlow to keep nodes up when I drain a pipeline

Use Of experiments=no_use_multiple_sdk_containers in Google cloud dataflow

Can Google Cloud Dataflow be run without an external IP address in Go?

How to use GCP Cloud SQL as Dataflow source and/or sink with Python?

How to run a Python Google Cloud Dataflow job with a custom Docker image?

Google Cloud Dataflow (Python): function to read from and write to a .csv file?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How can I tell Dataflow to "use_unsupported_python_version" with PipelineOptions? Can't use ping command in Google Cloud Dataflow vm? Python logs not appearing in Google Dataflow Can I dynamically alter log levels in Google Dataflow once the job has started? Can I configure Google DataFlow to keep nodes up when I drain a pipeline Use Of experiments=no_use_multiple_sdk_containers in Google cloud dataflow Can Google Cloud Dataflow be run without an external IP address in Go? How to use GCP Cloud SQL as Dataflow source and/or sink with Python? How to run a Python Google Cloud Dataflow job with a custom Docker image? Google Cloud Dataflow (Python): function to read from and write to a .csv file?

Related Tags

Can I use google DataFlow with native python?

Question

2 answers

solution1
1 2021-08-23 14:11:25

solution2
1 2021-08-23 15:14:22

Can I use google DataFlow with native python?

Question

2 answers

solution1 1 2021-08-23 14:11:25

solution2 1 2021-08-23 15:14:22

solution1
1 2021-08-23 14:11:25

solution2
1 2021-08-23 15:14:22