简体   繁体   中英

Google Cloud Dataflow - Java SDK vs Python SDK

I'm starting to use Google Cloud Dataflow for batch and streaming processing. The jobs being developed are mostly for ingesting data from different sources (MySQL, Kafka, and file systems), cleansing them, do some streaming and bath aggregation, and writing back to Google Cloud Storage.

Given these tasks, are there any recommendations for using the Java SDK or Python SDK for writing the jobs? Any noticeable differences in terms of performance and features between them?

For example, I noticed that for the Java SDK, the built-in I/O PTransform JdbcIO is available. This PTransform reads and writes data on JDBC, and this is not available in the Python SDK (so far). Is it possible to use the Java SDK only to create a pipeline that reads from a MySQL database and writes to Google Cloud Storage, while for the other pipelines use a different SDK (eg Python)?

Thanks in advance for your time!

I would go ahead with java SDK as the features and external connectors are more in java. But python SDK is also catching up.

As far as performance considerations are concerned when we submit a beam job to dataflow, the job steps will be sent in an API call to the google cloud dataflow. Hence, I think there's no significant difference in performance as far as dataflow is concerned

I've been using the Python SDK for development. While there is the built-in PTransform JdbcIO in the Java SDK, there are some community packages, such as beam-nuggets available for python , which can be used for reading from and writing to MySQL . This is what I have been using to develop ETLs.

The link for the package: https://pypi.org/project/beam-nuggets/

Overall, there are more features in the Java SDK.

If you are more comfortable with python , you can definitely write some pipelines in java that require the use of certain unique features, and the rest in python .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM