简体   繁体   English

Apache Python 与 Java 之间的光束性能在 GCP 数据流上运行

[英]Apache Beam Performance Between Python Vs Java Running on GCP Dataflow

We have Beam data pipeline running on GCP dataflow written using both Python and Java.我们在使用 Python 和 Java 编写的 GCP 数据流上运行 Beam 数据管道。 In the beginning, we had some simple and straightforward python beam jobs that works very well.一开始,我们有一些简单直接的 python 梁作业,效果很好。 So most recently we decided to transform more java beam to python beam job.所以最近我们决定将更多的 java 光束转换为 python 光束作业。 When we having more complicated job, especially the job requiring windowing in the beam, we noticed that there is a significant slowness in python job than java job which end up using more cpu and memory and cost much more.当我们有更复杂的工作时,特别是需要在光束中开窗的工作时,我们注意到 python 工作比 java 工作明显慢,最终使用更多的 CPU 和 ZCD69B4957F06CD818D7BF3D61980E2 和成本更多

some sample python code looks like:一些示例 python 代码如下所示:

        step1 = (
        read_from_pub_sub
        | "MapKey" >> beam.Map(lambda elem: (elem.data[key], elem))
        | "WindowResults"
        >> beam.WindowInto(
            beam.window.SlidingWindows(360,90),
            allowed_lateness=args.allowed_lateness,
        )
        | "GroupById" >> beam.GroupByKey()

And Java code is like: Java 代码如下:

 PCollection<DataStructure> step1 =
      message
          .apply(
              "MapKey",
              MapElements.into(
                      TypeDescriptors.kvs(
                          TypeDescriptors.strings(), TypeDescriptor.of(DataStructure.class)))
                  .via(event -> KV.of(event.key, event)))
          .apply(
              "WindowResults",
              Window.<KV<String, CustomInterval>>into(
                      SlidingWindows.of(Duration.standardSeconds(360))
                          .every(Duration.standardSeconds(90)))
                  .withAllowedLateness(Duration.standardSeconds(this.allowedLateness))
                  .discardingFiredPanes())
          .apply("GroupById", GroupByKey.<String, DataStructure>create())

We noticed Python is always using like 3 more times CPU and memory than Java needed.我们注意到 Python 使用的 CPU 和 memory 总是比所需的 Java 多 3 倍。 We did some experimental tests that just ran JSON input and JSON output, same results.我们做了一些实验测试,只运行 JSON 输入和 JSON output,结果相同。 We are not sure that is just because Python, in general, is slower than java or the way the GCP Dataflow execute Beam Python and Java is different. We are not sure that is just because Python, in general, is slower than java or the way the GCP Dataflow execute Beam Python and Java is different. Any similar experience, tests and reasons why this is are appreciated.任何类似的经验、测试和原因都值得赞赏。

Yes, this is a very normal performance factor between Python and Java.是的,这是 Python 和 Java 之间的一个非常正常的性能因素。 In fact, for many programs the factor can be 10x or much more.事实上,对于许多程序来说,这个因素可能是 10 倍或更多。

The details of the program can radically change the relative performance.程序的细节可以从根本上改变相对性能。 Here are some things to consider:这里有一些要考虑的事情:

If you prefer Python for its concise syntax or library ecosystem, the approach to achieve speed is to use optimized C libraries or Cython for the core processing, for example using pandas/numpy/etc.如果您更喜欢 Python 的简洁语法或库生态系统,那么实现速度的方法是使用优化的 C 库或 Cython 进行核心处理,例如使用 pandas/numpy/等。 If you useBeam's new Pandas-compatible dataframe API you will automatically get this benefit.如果您使用Beam 新的兼容 Pandas 的 dataframe API ,您将自动获得此好处。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 GCP 数据流上使用 python apache 光束中的 scipy - Using scipy in python apache beam on GCP Dataflow 如何在Python 3.x上获取数据流GCP的apache beam - How to get apache beam for dataflow GCP on Python 3.x GCP 数据流 + Apache Beam - 缓存问题 - GCP Dataflow + Apache Beam - caching question 使用 Apache Beam(GCP 数据流)写入 Kafka - Write To Kafka using Apache Beam (GCP Dataflow) 在数据流上运行Apache Beam Python的奇怪的腌制错误 - Weird pickling error running Apache Beam Python on Dataflow Apache Beam GroupByKey() 在 Python 中的 Google DataFlow 上运行时失败 - Apache Beam GroupByKey() fails when running on Google DataFlow in Python 如何将 numpy 导入在 GCP Dataflow 上运行的 Apache Beam 管道? - How do I import numpy into an Apache Beam pipeline, running on GCP Dataflow? 预计 ETA 在使用 python 的 apache beam GCP 数据流管道中使用管道 I/O 和运行时参数? - Expected ETA to avail Pipeline I/O and runtime parameters in apache beam GCP dataflow pipeline using python? 在 GCP Dataflow/Apache Beam Python SDK 中,DoFn.process 是否有时间限制? - In GCP Dataflow/Apache Beam Python SDK, is there a time limit for DoFn.process? 通过 Dataflow runner 使用 python 运行 Apache Beam 时出现错误我的导入在数据流中不起作用它正在运行 cloudshell - getting Error Running Apache Beam with python via Dataflow runner my import is not working in dataflow it working cloudshell
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM