简体繁体 English

谷歌云数据流 - Java SDK 与 Python ZF20E3C5E54C3AB3D376DAZ60F66

[英]Google Cloud Dataflow - Java SDK vs Python SDK

原文 2020-07-07 15:27:39 8 2 java/ python/ google-cloud-platform/ google-cloud-dataflow/ apache-beam

I'm starting to use Google Cloud Dataflow for batch and streaming processing.我开始使用 Google Cloud Dataflow 进行批处理和流处理。 The jobs being developed are mostly for ingesting data from different sources (MySQL, Kafka, and file systems), cleansing them, do some streaming and bath aggregation, and writing back to Google Cloud Storage.正在开发的作业主要用于从不同来源（MySQL、Kafka 和文件系统）摄取数据、清理它们、进行一些流式传输和浴聚合，以及写回 Google Cloud Storage。

Given these tasks, are there any recommendations for using the Java SDK or Python SDK for writing the jobs?鉴于这些任务，对于使用Java SDK或Python ZF20E3C56D60作业有什么建议吗？ Any noticeable differences in terms of performance and features between them?它们之间在性能和功能方面有什么明显差异吗？

For example, I noticed that for the Java SDK, the built-in I/O PTransform JdbcIO is available.例如，我注意到对于 Java SDK，内置 I/O PTransform JdbcIO可用。 This PTransform reads and writes data on JDBC, and this is not available in the Python SDK (so far).此PTransform在 JDBC 上读取和写入数据，这在 Python SDK 中不可用（到目前为止）。 Is it possible to use the Java SDK only to create a pipeline that reads from a MySQL database and writes to Google Cloud Storage, while for the other pipelines use a different SDK (eg Python)? Is it possible to use the Java SDK only to create a pipeline that reads from a MySQL database and writes to Google Cloud Storage, while for the other pipelines use a different SDK (eg Python)?

Thanks in advance for your time!在此先感谢您的时间！

2 个解决方案

I would go ahead with java SDK as the features and external connectors are more in java.我会 go 领先于java SDK 因为功能和外部连接器在 Z93F749Z4F894B8 中更多。 But python SDK is also catching up.但python SDK 也在迎头赶上。

As far as performance considerations are concerned when we submit a beam job to dataflow, the job steps will be sent in an API call to the google cloud dataflow.就性能考虑而言，当我们向数据流提交梁作业时，作业步骤将在 API 调用中发送到谷歌云数据流。 Hence, I think there's no significant difference in performance as far as dataflow is concerned因此，我认为就数据流而言，性能没有显着差异

I've been using the Python SDK for development.我一直在使用 Python SDK 进行开发。 While there is the built-in PTransform JdbcIO in the Java SDK, there are some community packages, such as beam-nuggets available for python , which can be used for reading from and writing to MySQL . While there is the built-in PTransform JdbcIO in the Java SDK, there are some community packages, such as beam-nuggets nuggets available for python , which can be used for reading from and writing to MySQL . This is what I have been using to develop ETLs.这就是我用来开发 ETL 的方法。

The link for the package: https://pypi.org/project/beam-nuggets/ package 的链接： https://pypi.org/project/beam-nuggets/

Overall, there are more features in the Java SDK.总的来说，Java SDK的功能比较多。

If you are more comfortable with python , you can definitely write some pipelines in java that require the use of certain unique features, and the rest in python . If you are more comfortable with python , you can definitely write some pipelines in java that require the use of certain unique features, and the rest in python .