[英]Google Cloud Dataflow - Java SDK vs Python SDK
I'm starting to use Google Cloud Dataflow for batch and streaming processing.我开始使用 Google Cloud Dataflow 进行批处理和流处理。 The jobs being developed are mostly for ingesting data from different sources (MySQL, Kafka, and file systems), cleansing them, do some streaming and bath aggregation, and writing back to Google Cloud Storage.
正在开发的作业主要用于从不同来源(MySQL、Kafka 和文件系统)摄取数据、清理它们、进行一些流式传输和浴聚合,以及写回 Google Cloud Storage。
Given these tasks, are there any recommendations for using the Java SDK or Python SDK for writing the jobs?鉴于这些任务,对于使用Java SDK或Python ZF20E3C56D60作业有什么建议吗? Any noticeable differences in terms of performance and features between them?
它们之间在性能和功能方面有什么明显差异吗?
For example, I noticed that for the Java SDK, the built-in I/O PTransform
JdbcIO is available.例如,我注意到对于 Java SDK,内置 I/O
PTransform
JdbcIO可用。 This PTransform
reads and writes data on JDBC, and this is not available in the Python SDK (so far).此
PTransform
在 JDBC 上读取和写入数据,这在 Python SDK 中不可用(到目前为止)。 Is it possible to use the Java SDK only to create a pipeline that reads from a MySQL database and writes to Google Cloud Storage, while for the other pipelines use a different SDK (eg Python)? Is it possible to use the Java SDK only to create a pipeline that reads from a MySQL database and writes to Google Cloud Storage, while for the other pipelines use a different SDK (eg Python)?
Thanks in advance for your time!在此先感谢您的时间!
I would go ahead with java
SDK as the features and external connectors are more in java.我会 go 领先于
java
SDK 因为功能和外部连接器在 Z93F749Z4F894B8 中更多。 But python
SDK is also catching up.但
python
SDK 也在迎头赶上。
As far as performance considerations are concerned when we submit a beam job to dataflow, the job steps will be sent in an API call to the google cloud dataflow.就性能考虑而言,当我们向数据流提交梁作业时,作业步骤将在 API 调用中发送到谷歌云数据流。 Hence, I think there's no significant difference in performance as far as dataflow is concerned
因此,我认为就数据流而言,性能没有显着差异
I've been using the Python SDK for development.我一直在使用 Python SDK 进行开发。 While there is the built-in
PTransform
JdbcIO in the Java SDK, there are some community packages, such as beam-nuggets
available for python
, which can be used for reading from and writing to MySQL
. While there is the built-in
PTransform
JdbcIO in the Java SDK, there are some community packages, such as beam-nuggets
nuggets available for python
, which can be used for reading from and writing to MySQL
. This is what I have been using to develop ETLs.这就是我用来开发 ETL 的方法。
The link for the package: https://pypi.org/project/beam-nuggets/ package 的链接: https://pypi.org/project/beam-nuggets/
Overall, there are more features in the Java SDK.总的来说,Java SDK的功能比较多。
If you are more comfortable with python
, you can definitely write some pipelines in java
that require the use of certain unique features, and the rest in python
. If you are more comfortable with
python
, you can definitely write some pipelines in java
that require the use of certain unique features, and the rest in python
.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.