Apache Python 与 Java 之间的光束性能在 GCP 数据流上运行

Question

We have Beam data pipeline running on GCP dataflow written using both Python and Java.我们在使用 Python 和 Java 编写的 GCP 数据流上运行 Beam 数据管道。 In the beginning, we had some simple and straightforward python beam jobs that works very well.一开始，我们有一些简单直接的 python 梁作业，效果很好。 So most recently we decided to transform more java beam to python beam job.所以最近我们决定将更多的 java 光束转换为 python 光束作业。 When we having more complicated job, especially the job requiring windowing in the beam, we noticed that there is a significant slowness in python job than java job which end up using more cpu and memory and cost much more.当我们有更复杂的工作时，特别是需要在光束中开窗的工作时，我们注意到 python 工作比 java 工作明显慢，最终使用更多的 CPU 和 ZCD69B4957F06CD818D7BF3D61980E2 和成本更多

some sample python code looks like:一些示例 python 代码如下所示：

        step1 = (
        read_from_pub_sub
        | "MapKey" >> beam.Map(lambda elem: (elem.data[key], elem))
        | "WindowResults"
        >> beam.WindowInto(
            beam.window.SlidingWindows(360,90),
            allowed_lateness=args.allowed_lateness,
        )
        | "GroupById" >> beam.GroupByKey()

And Java code is like: Java 代码如下：

 PCollection<DataStructure> step1 =
      message
          .apply(
              "MapKey",
              MapElements.into(
                      TypeDescriptors.kvs(
                          TypeDescriptors.strings(), TypeDescriptor.of(DataStructure.class)))
                  .via(event -> KV.of(event.key, event)))
          .apply(
              "WindowResults",
              Window.<KV<String, CustomInterval>>into(
                      SlidingWindows.of(Duration.standardSeconds(360))
                          .every(Duration.standardSeconds(90)))
                  .withAllowedLateness(Duration.standardSeconds(this.allowedLateness))
                  .discardingFiredPanes())
          .apply("GroupById", GroupByKey.<String, DataStructure>create())

We noticed Python is always using like 3 more times CPU and memory than Java needed.我们注意到 Python 使用的 CPU 和 memory 总是比所需的 Java 多 3 倍。 We did some experimental tests that just ran JSON input and JSON output, same results.我们做了一些实验测试，只运行 JSON 输入和 JSON output，结果相同。 We are not sure that is just because Python, in general, is slower than java or the way the GCP Dataflow execute Beam Python and Java is different. We are not sure that is just because Python, in general, is slower than java or the way the GCP Dataflow execute Beam Python and Java is different. Any similar experience, tests and reasons why this is are appreciated.任何类似的经验、测试和原因都值得赞赏。

Answer 1

Yes, this is a very normal performance factor between Python and Java.是的，这是 Python 和 Java 之间的一个非常正常的性能因素。 In fact, for many programs the factor can be 10x or much more.事实上，对于许多程序来说，这个因素可能是 10 倍或更多。

The details of the program can radically change the relative performance.程序的细节可以从根本上改变相对性能。 Here are some things to consider:这里有一些要考虑的事情：

Profiling the Dataflow job (official docs) 分析数据流作业（官方文档）
Profiling a Dataflow pipeline (medium blog) 分析数据流管道（中型博客）
Profiling Apache Beam Python pipelines (another medium blog) 分析 Apache 梁 Python 管道（另一个中型博客）
Profiling Python (general Cloud Profiler docs)分析 Python（通用 Cloud Profiler 文档）
How can I profile a Python Dataflow job? 如何分析 Python 数据流作业？ (previous StackOverflow question on profiling Python job) （之前关于分析 Python 作业的 StackOverflow 问题）

If you prefer Python for its concise syntax or library ecosystem, the approach to achieve speed is to use optimized C libraries or Cython for the core processing, for example using pandas/numpy/etc.如果您更喜欢 Python 的简洁语法或库生态系统，那么实现速度的方法是使用优化的 C 库或 Cython 进行核心处理，例如使用 pandas/numpy/等。 If you useBeam's new Pandas-compatible dataframe API you will automatically get this benefit.如果您使用Beam 新的兼容 Pandas 的 dataframe API ，您将自动获得此好处。

Apache Python 与 Java 之间的光束性能在 GCP 数据流上运行

问题描述

1 个解决方案

解决方案1
0 2022-01-21 21:31:33

Apache Python 与 Java 之间的光束性能在 GCP 数据流上运行

问题描述

1 个解决方案

解决方案1 0 2022-01-21 21:31:33

解决方案1
0 2022-01-21 21:31:33