I am planning to prepare a server-less data pipeline with Google Cloud Platform. My plan is to use Dataflow/ Dataproc for batch processing data from three different sources.
My input sources are:
But after reading their documentation I got they don't have any input for cloud SQL or MongoDB.
Also I have checked their custom driver section but this is only for Java, but I am planning to use Python.
Is there any idea how I can ingest those 3 different sources with Data Flow/ Dataproc ?
In your situation I think the best option is to use Dataproc. Whenever it is going to be batch processing.
This way you can use Hadoop or Spark and you can have more control over the workflow.
You can use Python code with Spark. {1}
You can do SQL queries with Spark. {2}
There is also a connector for MongoDB and Spark. {3}
And a connector for MongoDB and Hadoop. {4}
{1}: https://spark.apache.org/docs/0.9.0/python-programming-guide.html
{2}: https://spark.apache.org/docs/latest/sql-programming-guide.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.