使用 Python SDK 在 Spark 上运行 Apache Beam wordcount 管道时并行度低

Question

I am quite experienced with Spark cluster configuration and running Pyspark pipelines, but I'm just starting with Beam.我在 Spark 集群配置和运行 Pyspark 管道方面非常有经验，但我才刚刚开始使用 Beam。 So, I am trying to do an apple-to-apple comparison between Pyspark and the Beam python SDK on a Spark PortableRunner (running on top of the same small Spark cluster, 4 workers each with 4 cores and 8GB RAM), and I've settled on a wordcount job for a reasonably large dataset, storing the results in a Parquet table.因此，我正在尝试在 Spark PortableRunner（在同一个小型 Spark 集群上运行，4 个工作人员，每个工作人员具有 4 个内核和 8GB RAM）上的 Pyspark 和 Beam python SDK 之间进行逐个比较，并且我'已经决定为一个相当大的数据集进行 wordcount 作业，将结果存储在 Parquet 表中。

I have thus downloaded 50GB of Wikipedia text files, splitted across about 100 uncompressed files, and stored them in the directory /mnt/nfs_drive/wiki_files/ ( /mnt/nfs_drive is a NFS drive mounted on all workers).因此，我下载了 50GB 的 Wikipedia 文本文件，分成大约 100 个未压缩的文件，并将它们存储在目录/mnt/nfs_drive/wiki_files/ （ /mnt/nfs_drive是安装在所有工作人员上的 NFS 驱动器）。

First, I am running the following Pyspark wordcount script:首先，我正在运行以下 Pyspark wordcount 脚本：

from pyspark.sql import SparkSession, Row
from operator import add
wiki_files = '/mnt/nfs_drive/wiki_files/*'

spark = SparkSession.builder.appName("WordCountSpark").getOrCreate()

spark_counts = spark.read.text(wiki_files).rdd.map(lambda r: r['value']) \
    .flatMap(lambda x: x.split(' ')) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(add) \
    .map(lambda x: Row(word=x[0], count=x[1]))

spark.createDataFrame(spark_counts).write.parquet(path='/mnt/nfs_drive/spark_output', mode='overwrite')

The script runs perfectly well and outputs the parquet files at the desired location in about 8 minutes.该脚本运行良好，并在大约 8 分钟内在所需位置输出镶木地板文件。 The main stage (reading and splitting tokens) is divided in a reasonable number of tasks, so that the cluster is used efficiently:主要阶段（读取和拆分令牌）被划分为合理数量的任务，以便有效地使用集群：

I am now trying to achieve the same with Beam and the portable runner.我现在正在尝试使用 Beam 和便携式跑步机实现相同的目标。 First, I have started the Spark job server (on the Spark master node) with the following command:首先，我使用以下命令启动了 Spark 作业服务器（在 Spark 主节点上）：

docker run --rm --net=host -e SPARK_EXECUTOR_MEMORY=8g apache/beam_spark_job_server:2.25.0 --spark-master-url=spark://localhost:7077

Then, on the master and worker nodes, I am running the SDK Harness as follows:然后，在主节点和工作节点上，我按如下方式运行 SDK Harness：

 docker run --net=host -d --rm -v /mnt/nfs_drive:/mnt/nfs_drive apache/beam_python3.6_sdk:2.25.0 --worker_pool

Now that the Spark cluster is set up to run Beam pipelines, I can submit the following script:现在 Spark 集群已设置为运行 Beam 管道，我可以提交以下脚本：

import apache_beam as beam
import pyarrow
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import fileio

options = PipelineOptions([
    "--runner=PortableRunner",
    "--job_endpoint=localhost:8099",
    "--environment_type=EXTERNAL",
    "--environment_config=localhost:50000",
    "--job_name=WordCountBeam"
])


wiki_files = '/mnt/nfs_drive/wiki_files/*'

p = beam.Pipeline(options=options)
beam_counts = (
    p
    | fileio.MatchFiles(wiki_files)
    | beam.Map(lambda x: x.path)
    | beam.io.ReadAllFromText()
    | 'ExtractWords' >> beam.FlatMap(lambda x: x.split(' '))
    | beam.combiners.Count.PerElement()
    | beam.Map(lambda x: {'word': x[0], 'count': x[1]})
)


_ = beam_counts | 'Write' >> beam.io.WriteToParquet('/mnt/nfs_drive/beam_output',
      pyarrow.schema(
          [('word', pyarrow.binary()), ('count', pyarrow.int64())]
      )
)

result = p.run().wait_until_finish()

The code is submitted successfully, I can see the job on the Spark UI and the workers are executing it.代码提交成功，我可以在 Spark UI 上看到作业并且工作人员正在执行它。 However, even if left running for more than 1 hour, it does not produce any output!但是，即使运行超过 1 小时，它也不会产生任何输出！

I thus wanted to make sure that there is not a problem with my setup, so I've run the exact same script on a smaller dataset (just 1 Wiki file).因此，我想确保我的设置没有问题，所以我在较小的数据集（只有 1 个 Wiki 文件）上运行了完全相同的脚本。 This completes successfully in about 3.5 minutes (Spark wordcount on the same dataset takes 16s!).这在大约 3.5 分钟内成功完成（同一数据集上的 Spark 字数需要 16 秒！）。

I wondered how could Beam be that slower, so I started looking at the DAG submitted to Spark by the Beam pipeline via the job server.我想知道 Beam 怎么会这么慢，所以我开始查看 Beam 管道通过作业服务器提交给 Spark 的 DAG。 I noticed that the Spark job spends most of the time in the following stage:我注意到 Spark 作业大部分时间都花在以下阶段：

This is just splitted in 2 tasks, as shown here:这只是分成 2 个任务，如下所示：

Printing debugging lines show that this task is where the "heavy-lifting" (ie reading lines from the wiki files and splitting tokens) is performed - however, since this happens in 2 tasks only, the work will be distributed on 2 workers at most.打印调试行显示此任务是执行“繁重工作”（即从 wiki 文件中读取行并拆分令牌）的地方 - 但是，由于这仅发生在 2 个任务中，因此工作最多将分配给 2 个工作人员. What's also interesting is that running on the large 50GB dataset results on exactly the same DAG with exactly the same number of tasks.同样有趣的是，在 50GB 的大型数据集上运行会在完全相同的 DAG 上运行完全相同的任务数量。

I am quite unsure how to proceed further.我非常不确定如何进一步进行。 It seems like the Beam pipeline has reduced parallelism, but I'm not sure if this is due to sub-optimal translation of the pipeline by the job server, or whether I should specify my PTransforms in some other way to increase the parallelism on Spark.似乎 Beam 管道降低了并行度，但我不确定这是否是由于作业服务器对管道的次优转换，或者我是否应该以其他方式指定我的 PTransforms 以增加 Spark 上的并行度.

Any suggestion appreciated!任何建议表示赞赏！

Answer 1

The file IO part of the pipeline can be simplified by using apache_beam.io.textio.ReadFromText(file_pattern='/mnt/nfs_drive/wiki_files/*') .可以使用apache_beam.io.textio.ReadFromText(file_pattern='/mnt/nfs_drive/wiki_files/*')简化管道的文件 IO 部分。

Fusion is another reason that could prevent parallelism. 融合是另一个可能阻止并行性的原因。 The solution is to throw in a apache_beam.transforms.util.Reshuffle after reading in all the files.解决方案是在读入所有文件后apache_beam.transforms.util.Reshuffle 。

Answer 2

It took a while, but I figured out what's the issue and a workaround.花了一段时间，但我想出了问题所在和解决方法。

The underlying problem is in Beam's portable runner, specifically where the Beam job is translated into a Spark job.潜在的问题在于 Beam 的便携式运行器，特别是将 Beam 作业转换为 Spark 作业的地方。

The translation code (executed by the job server) splits stages into tasks based on calls to sparkContext().defaultParallelism() .翻译代码（由作业服务器执行）根据对sparkContext().defaultParallelism()调用将阶段拆分为任务。 The job server does not configure default parallelism explicitly (and does not allow the user to set that through pipeline options), hence it falls back, in theory , to configure the parallelism based on the numbers of executors (see the explanation here https://spark.apache.org/docs/latest/configuration.html#execution-behavior ).作业服务器没有明确配置默认并行度（并且不允许用户通过管道选项设置它），因此理论上它会根据执行程序的数量配置并行度（请参阅此处的解释https:/ /spark.apache.org/docs/latest/configuration.html#execution-behavior ）。 This seems to be the goal of the translation code when calling defaultParallelism() .这似乎是调用defaultParallelism()时翻译代码的目标。

Now, in practice , it is well known that, when relying on the fallback mechanism, calling sparkContext().defaultParallelism() too soon can result in lower numbers than expected since executors might not have registered with the context yet.现在，在实践中，众所周知，当依赖回退机制时， sparkContext().defaultParallelism()调用sparkContext().defaultParallelism()会导致数字低于预期，因为 executors 可能尚未在上下文中注册。 In particular, calling defaultParallelism() too soon will give 2 as result, and stages will be split into 2 tasks only.特别是，过早调用defaultParallelism()会导致结果为 2，并且阶段将仅分为 2 个任务。

My "dirty hack" workaround thus consists in modifying the source code of the job server by simply adding a delay of 3 seconds after instantiating SparkContext and before doing anything else:因此，我的“脏黑客”解决方法包括通过在实例化SparkContext和执行其他任何操作之前简单地添加 3 秒的延迟来修改作业服务器的源代码：

$ git diff                                                                                                                                                                                                                                                                                                                         v2.25.0
diff --git a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java b/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
index aa12192..faaa4d3 100644
--- a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
+++ b/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
@@ -95,7 +95,13 @@ public final class SparkContextFactory {
       conf.setAppName(contextOptions.getAppName());
       // register immutable collections serializers because the SDK uses them.
       conf.set("spark.kryo.registrator", SparkRunnerKryoRegistrator.class.getName());
-      return new JavaSparkContext(conf);
+      JavaSparkContext jsc = new JavaSparkContext(conf);
+      try {
+        Thread.sleep(3000);
+      } catch (InterruptedException e) {
+      }
+      return jsc;
     }
   }
 }

After recompiling the job server and launching it with this change, all the calls to defaultParallelism() are done after the executors are registered, and the stages are nicely split in 16 tasks (same as numbers of executors).在重新编译作业服务器并使用此更改启动它之后，所有对defaultParallelism()的调用都在执行器注册后完成，并且阶段很好地分为 16 个任务（与执行器的数量相同）。 As expected, the job now completes much faster since there are many more workers doing the work (it is however still 3 times slower than the pure Spark wordcount).正如预期的那样，这项工作现在完成得更快，因为有更多的工人在做这项工作（但它仍然比纯 Spark 字数慢 3 倍）。

While this works, it is of course not a great solution.虽然这有效，但它当然不是一个很好的解决方案。 A much better solution would be one of the following:更好的解决方案是以下之一：

change the translation engine so that it can deduce the number of tasks based on the number of available executors in a more robust way;更改翻译引擎，使其能够以更稳健的方式根据可用执行程序的数量推断任务数量；
allow the user to configure, via pipeline options, the default parallelism to be used by the job server for translating jobs (this is what's done by the Flink portable runner).允许用户通过管道选项配置作业服务器用于翻译作业的默认并行度（这是 Flink 便携式运行器所做的）。

Until a better solution is in place, it clearly prevents any use of Beam Spark job server in a production cluster.在出现更好的解决方案之前，它显然会阻止在生产集群中使用 Beam Spark 作业服务器。 I will post the issue to Beam's ticket queue so that a better solution can be implemented (hopefully soon).我会将问题发布到 Beam 的票务队列，以便可以实施更好的解决方案（希望很快）。

使用 Python SDK 在 Spark 上运行 Apache Beam wordcount 管道时并行度低

问题描述

2 个解决方案

解决方案1
2 2020-11-22 01:07:46

解决方案2
0 已采纳 2021-01-07 16:59:58

使用 Python SDK 在 Spark 上运行 Apache Beam wordcount 管道时并行度低

问题描述

2 个解决方案

解决方案1 2 2020-11-22 01:07:46

解决方案2 0 已采纳 2021-01-07 16:59:58

解决方案1
2 2020-11-22 01:07:46

解决方案2
0 已采纳 2021-01-07 16:59:58