简体繁体中英

Use Google Cloud SQL or MongoDB as a input for Dataflow/ Dataproc

原文 2018-06-11 21:07:53 7 1 python/ google-cloud-platform/ google-cloud-dataflow/ google-cloud-sql/ google-cloud-dataproc

I am planning to prepare a server-less data pipeline with Google Cloud Platform. My plan is to use Dataflow/ Dataproc for batch processing data from three different sources.

My input sources are:

Cloud SQL (MySQL)
Cloud SQL (PostgreSQL)
MongoDB

But after reading their documentation I got they don't have any input for cloud SQL or MongoDB.

Also I have checked their custom driver section but this is only for Java, but I am planning to use Python.

Is there any idea how I can ingest those 3 different sources with Data Flow/ Dataproc ?

1 answers

In your situation I think the best option is to use Dataproc. Whenever it is going to be batch processing.

This way you can use Hadoop or Spark and you can have more control over the workflow.

You can use Python code with Spark. {1}

You can do SQL queries with Spark. {2}

There is also a connector for MongoDB and Spark. {3}

And a connector for MongoDB and Hadoop. {4}

{1}: https://spark.apache.org/docs/0.9.0/python-programming-guide.html

{2}: https://spark.apache.org/docs/latest/sql-programming-guide.html

{3}: https://docs.mongodb.com/spark-connector/master/

{4}: https://docs.mongodb.com/ecosystem/tools/hadoop/

Google Cloud Dataproc OOM issue

Google Cloud Dataflow Dependencies

Google Cloud Dataflow with Python

How to use google cloud storage in dataflow pipeline run from datalab

How to use GCP Cloud SQL as Dataflow source and/or sink with Python?

Google Cloud Dataflow Python --maxNumWorkers

PySpark via Dataproc + SSL Connection to Cloud SQL

Google Cloud Dataflow Stream + Batch

Google Cloud Dataflow - Autoscaling not working

Google Cloud Dataflow consume external source

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Google Cloud Dataproc OOM issue Google Cloud Dataflow Dependencies Google Cloud Dataflow with Python How to use google cloud storage in dataflow pipeline run from datalab How to use GCP Cloud SQL as Dataflow source and/or sink with Python? Google Cloud Dataflow Python --maxNumWorkers PySpark via Dataproc + SSL Connection to Cloud SQL Google Cloud Dataflow Stream + Batch Google Cloud Dataflow - Autoscaling not working Google Cloud Dataflow consume external source

Related Tags

Use Google Cloud SQL or MongoDB as a input for Dataflow/ Dataproc

Question

1 answers

solution1 1 2018-06-12 10:08:01

solution1
1 2018-06-12 10:08:01