为什么RDD没有在火花的每次迭代中都保留在内存中？

Question

I use the spark for machine learning application. 我将火花用于机器学习应用程序。 The spark and hadoop share the same computer clusters with out any resource manger such as yarn. spark和hadoop共享相同的计算机集群，而没有任何资源管理器，例如yarn。 We can run hadoop job while running spark task. 我们可以在运行spark任务时运行hadoop作业。

But the machine learning application run so slowly. 但是机器学习应用程序运行得如此缓慢。 I found that for every interation, some workers need to add some rdd into memory. 我发现对于每次交互，有些工作人员需要向内存中添加一些rdd。 Just like this: 像这样：

243413 14/07/23 13:30:07 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_17 in memory on XXX:48238 (size: 118.3 MB, free: 16.2 GB)
243414 14/07/23 13:30:07 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_17 in memory on XXX:48238 (size: 118.3 MB, free: 16.2 GB)
243415 14/07/23 13:30:08 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_19 in memory on TS-XXX:48238 (size: 119.0 MB, free: 16.1 GB)

So, I think the recomputing for reload the rdd make the application so slowly. 因此，我认为重新加载rdd的重新计算会使应用程序运行得如此缓慢。

Then, my question is why the rdd was not persisted in the memory when there was enough free memory? 然后，我的问题是，当有足够的可用内存时，为什么rdd不能保留在内存中？ because of the hadoop jobs? 因为Hadoop的工作？

I add the following jvm parameters: -Xmx10g -Xms10g 我添加了以下jvm参数：-Xmx10g -Xms10g

I found there was less rdd add actions than before, and the task run time was shorter than before. 我发现rdd add操作比以前少了，任务运行时间也比以前短。 But the total time for one stage is also too large. 但是一个阶段的总时间也太大。 From the webUI, I found that: 从webUI，我发现：

For every stage, all the workers were not start at the same time. 在每个阶段，并非所有工人都同时开始。 For example, when the worker_1 finished 10 tasks, the worker_2 appear on the webUI and start the tasks. 例如，当worker_1完成10个任务时，worker_2出现在webUI上并启动任务。 And this leds to a long time stage. 这导致了漫长的阶段。

our spark cluster works in standalone model. 我们的Spark集群可以独立运行。

Answer 1

It is hard to say what's wrong with your job, but here are some hints. 很难说你的工作出了什么问题，但这是一些提示。

First, you can try to call persist() on intermidiate RDD s to mark that you want them cached. 首先，您可以尝试在中间RDD上调用persist()以标记您希望它们被缓存。 Second, Spark is automatically storing on disk results of shuffling operations on RDD s at each node, so maybe the problem is not in caching at all. 其次，Spark会自动在磁盘上存储对每个节点的RDD进行混洗操作的结果，因此问题可能根本就不在缓存中。

You can find some additional information here: 您可以在此处找到一些其他信息：

RDD Persistence RDD持久性
Tuning Spark 调整火花

为什么RDD没有在火花的每次迭代中都保留在内存中？

问题描述

1 个解决方案

解决方案1
3 2014-07-24 05:57:19

为什么RDD没有在火花的每次迭代中都保留在内存中？

问题描述

1 个解决方案

解决方案1 3 2014-07-24 05:57:19

解决方案1
3 2014-07-24 05:57:19