Spark 独立应用程序实现 PCA，然后挂起 10-12 分钟，然后才从 memory 中删除 RDD

Question

I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96 and Spark-Cassandra-Connector 3.1.0.我有一个 16 节点集群，其中每个节点都安装了 Spark 和 Cassandra，复制因子为 3，spark.sql.shuffle.partitions 为 96，Spark-Cassandra-Connector 3.1.0。 I am doing a Spark-Join(broadcastHashJoin) between a dataset and a Cassandra table and then implement a PCA from SparkML library.我正在数据集和 Cassandra 表之间执行 Spark-Join(broadcastHashJoin)，然后从 SparkML 库实现 PCA。 Inbetween, I persist a dataset and I unpersist it only after the computations of the PCA are finished.在此期间，我保留了一个数据集，并且仅在 PCA 的计算完成后才取消保留它。 According to the stages tab from SparkUI, everything is finished in less than 10 minutes and generally no executor is doing anything:根据 SparkUI 的阶段选项卡，一切都在不到 10 分钟内完成，而且通常没有执行者在做任何事情：

but the persisted dataset is still persisted and stays like that for another 10-12 minutes as shown below from the Storage tab of SparkUI:但持久化数据集仍然存在，并保持 10-12 分钟，如下图所示，来自 SparkUI 的“存储”选项卡：

This is the last lines of stderr from one of the nodes where you can see there is a difference of 10 minutes in the last 2 lines:这是来自其中一个节点的 stderr 的最后几行，您可以看到最后两行有 10 分钟的差异：

22/09/15 11:41:09 INFO MemoryStore: Block taskresult_1436 stored as bytes in memory (estimated size 89.3 MiB, free 11.8 GiB)
22/09/15 11:41:09 INFO Executor: Finished task 3.0 in stage 33.0 (TID 1436). 93681153 bytes result sent via BlockManager)
22/09/15 11:51:49 INFO BlockManager: Removing RDD 20
22/09/15 12:00:24 INFO BlockManager: Removing RDD 20

While in the main console where the application runs I only get:在应用程序运行的主控制台中，我只得到：

1806703 [dispatcher-BlockManagerMaster] INFO  org.apache.spark.storage.BlockManagerInfo  - Removed broadcast_1_piece0 on 192.168.100.237:46523 in memory (size: 243.7 KiB, free: 12.1 GiB)
1806737 [block-manager-storage-async-thread-pool-75] INFO  org.apache.spark.storage.BlockManager  - Removing RDD 20

If I try to print the dataset after PCA is complete and before I unpersist it, then it still takes ~20 minutes, then it prints it and then unpersists it.如果我尝试在 PCA 完成后且在我取消保留之前打印数据集，那么它仍然需要大约 20 分钟，然后打印它然后取消保留它。 Why?为什么？ Would that have to do maybe with the query and the Cassandra table?这可能与查询和 Cassandra 表有关吗？

I have not enabled MLlib Linear Algebra Acceleration as I have ubuntu 20.04 which has incompatibility issues with libgfortran5, etc..but I am also not sure it would help.我没有启用 MLlib 线性代数加速，因为我有 ubuntu 20.04，它与 libgfortran5 等存在不兼容问题。但我也不确定它是否有帮助。 I am not sure where to look or for what to look in order to reduce these 20 minutes to 10. Any ideas what might be happening?我不确定要将这 20 分钟减少到 10 分钟，该去哪里寻找或寻找什么。有什么想法可能会发生什么吗？ Let me know if you want any more information.如果您需要更多信息，请告诉我。

Answer 1

It seems that activating the Linear Algebra Acceleration libraries of Apache Spark ML does make a difference, It reduced the PCA calculation time by 10 minutes, so no more Spark hanging!似乎激活 Apache Spark ML 的线性代数加速库确实有所不同，它减少了 10 分钟的 PCA 计算时间，所以不再挂起 Spark！

Spark 独立应用程序实现 PCA，然后挂起 10-12 分钟，然后才从 memory 中删除 RDD

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-09-29 06:57:53

Spark 独立应用程序实现 PCA，然后挂起 10-12 分钟，然后才从 memory 中删除 RDD

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-09-29 06:57:53

解决方案1
0 已采纳 2022-09-29 06:57:53