简体   繁体   中英

Use Google Cloud SQL or MongoDB as a input for Dataflow/ Dataproc

I am planning to prepare a server-less data pipeline with Google Cloud Platform. My plan is to use Dataflow/ Dataproc for batch processing data from three different sources.

My input sources are:

  1. Cloud SQL (MySQL)
  2. Cloud SQL (PostgreSQL)
  3. MongoDB

But after reading their documentation I got they don't have any input for cloud SQL or MongoDB.

Also I have checked their custom driver section but this is only for Java, but I am planning to use Python.

Is there any idea how I can ingest those 3 different sources with Data Flow/ Dataproc ?

In your situation I think the best option is to use Dataproc. Whenever it is going to be batch processing.

This way you can use Hadoop or Spark and you can have more control over the workflow.

You can use Python code with Spark. {1}

You can do SQL queries with Spark. {2}

There is also a connector for MongoDB and Spark. {3}

And a connector for MongoDB and Hadoop. {4}

{1}: https://spark.apache.org/docs/0.9.0/python-programming-guide.html

{2}: https://spark.apache.org/docs/latest/sql-programming-guide.html

{3}: https://docs.mongodb.com/spark-connector/master/

{4}: https://docs.mongodb.com/ecosystem/tools/hadoop/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM