加入后，Spark Job 卡在各个阶段之间

Question

I have a spark job which joins 2 datasets, performs some transformations and reduces the data to give output.我有一个火花作业，它连接 2 个数据集，执行一些转换并减少数据以提供 output。 The input size for now is pretty small (200MB datasets each), but after join, as you can see in DAG, the job is stuck and never proceeds with stage-4.现在的输入大小非常小（每个 200MB 数据集），但是在加入之后，正如您在 DAG 中看到的那样，该作业被卡住并且永远不会继续进行阶段 4。 I tried waiting for hours and it gave OOM and showed failed tasks for stage-4.我试着等了几个小时，它给了 OOM 并显示了第 4 阶段的失败任务。

Why doesnt spark show stage-4(data transformation stage) as active after stage-3(join stage)?为什么 spark 在第 3 阶段（加入阶段）之后不显示第 4 阶段（数据转换阶段）处于活动状态？ Is it stuck in the shuffle between stage-3 & 4?它是否陷入了第 3 阶段和第 4 阶段之间的洗牌？
What can I do to improve performance of my spark job?我可以做些什么来提高我的 Spark 作业的性能？ I tried increasing shuffle partitions and still same result.我尝试增加随机分区，结果仍然相同。

Job code:职位代码：


joinedDataset.groupBy("group_field")
    .agg(collect_set("name").as("names")).select("names").as[List[String]]
      .rdd.                                     //converting to rdd since I need to use reduceByKey                                  
      .flatMap(entry => generatePairs(entry))   // this line generates pairs of words out of input text, so data size increases here
      .map(pair => ((pair._1, pair._2), 1))
      .reduceByKey(_+_)
      .sortBy(entry => entry._2, ascending = false)
      .coalesce(1)

FYI My cluster has 3 worker nodes with 16 cores and 100GB RAM, 3 executors with 16 cores(1:1 ratio with machines for simplicity) and 64GB memory allocate.仅供参考，我的集群有 3 个 16 核和 100GB RAM 的工作节点，3 个 16 核的执行器（为简单起见，与机器的比例为 1:1）和 64GB memory 分配。

UPDATE: Turns out the data generated in my job is pretty huge.更新：原来我工作中生成的数据非常庞大。 I did some optimisations(strategically reduced input data and removed some duplicated strings from processing), now the job finishes within 3 hours.我做了一些优化（战略性地减少输入数据并从处理中删除了一些重复的字符串），现在工作在 3 小时内完成。 Input for stage 4 is 200MB and output is 200GB per se.第 4 阶段的输入为 200MB，output 本身为 200GB。 It uses parallelism properly bu it sucks at shuffle.它正确地使用了并行性，但它在洗牌时很糟糕。 My shuffle spill during this job was 1825 GB(memory) and 181 GB(disk).我在这项工作中的 shuffle 溢出是 1825 GB（内存）和 181 GB（磁盘）。 Can someone help me with reducing shuffle spill and duration of the job?有人可以帮助我减少洗牌溢出和工作持续时间吗？ Thanks.谢谢。

Answer 1

Try an initial sort on executor and then reduce + sort them尝试对 executor 进行初始排序，然后对它们进行归约 + 排序

joinedDataset.groupBy("group_field")
    .agg(collect_set("name").as("names")).select("names").as[List[String]]
      .rdd.                                     //converting to rdd since I need to use reduceByKey                                  
      .flatMap(entry => generatePairs(entry))   // this line generates pairs of words out of input text, so data size increases here
      .map(pair => ((pair._1, pair._2), 1))
      .sortBy(entry => entry._2, ascending = false) // Do a initial sort on executors
      .reduceByKey(_+_)
      .sortBy(entry => entry._2, ascending = false) 
      .coalesce(1)

加入后，Spark Job 卡在各个阶段之间

问题描述

1 个解决方案

解决方案1
0 2020-04-28 09:33:31

加入后，Spark Job 卡在各个阶段之间

问题描述

1 个解决方案

解决方案1 0 2020-04-28 09:33:31

解决方案1
0 2020-04-28 09:33:31