简体   繁体   中英

Why the RDD is not persisted in memory for every iteration in spark?

I use the spark for machine learning application. The spark and hadoop share the same computer clusters with out any resource manger such as yarn. We can run hadoop job while running spark task.

But the machine learning application run so slowly. I found that for every interation, some workers need to add some rdd into memory. Just like this:

243413 14/07/23 13:30:07 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_17 in memory on XXX:48238 (size: 118.3 MB, free: 16.2 GB)
243414 14/07/23 13:30:07 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_17 in memory on XXX:48238 (size: 118.3 MB, free: 16.2 GB)
243415 14/07/23 13:30:08 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_19 in memory on TS-XXX:48238 (size: 119.0 MB, free: 16.1 GB)

So, I think the recomputing for reload the rdd make the application so slowly.

Then, my question is why the rdd was not persisted in the memory when there was enough free memory? because of the hadoop jobs?


I add the following jvm parameters: -Xmx10g -Xms10g

I found there was less rdd add actions than before, and the task run time was shorter than before. But the total time for one stage is also too large. From the webUI, I found that:

For every stage, all the workers were not start at the same time. For example, when the worker_1 finished 10 tasks, the worker_2 appear on the webUI and start the tasks. And this leds to a long time stage.


our spark cluster works in standalone model.

It is hard to say what's wrong with your job, but here are some hints.

First, you can try to call persist() on intermidiate RDD s to mark that you want them cached. Second, Spark is automatically storing on disk results of shuffling operations on RDD s at each node, so maybe the problem is not in caching at all.

You can find some additional information here:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM