Spark non-cached RDD Eviction from memory

Question

From this post How long does RDD remain in memory? , I Would like to know based on the below:

An RDD is an object just like any other. If you don't persist/cache it, it will act as any other object under a managed language would and be collected once there are no alive root objects pointing to it?

What is meant exactly by once there are no alive root objects pointing to it ?

Eg when the Action has been completed?
Or if the transforms have been executed successfully?

I read as much as I could find, but find there is always an open issue in my mind. The well-known expert's response leave a lingering doubt in my mind that I am unable to evict.

The When does a RDD lineage is created? How to find lineage graph? example is great, re-presented here:

val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []

val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
 +-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
    |  MapPartitionsRDD[1] at map at <console>:25 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

Assuming that each transform takes a lengthy period for actual execution, then when can ... RDD[0] be evicted? The earliest point in time, that is. The point is that ...RDD[0] is a parent to ...RDD[1..N] or a parent to all such objects? I state this as I found such a statement elsewhere.

I do not think it is a duplicate it is seeking a clarification on the statement indicated.

My interpretation is that the term root object implies that RDD[0] cannot be subject to garbage collection until an Action has occurred or a cache or checkpoint in the Action DAG path has taken place. Seeking verification on this. The sentence for me on what the root object is, is now unclear. I would have thought the root objects are the earlier RDDs in the chain.

Answer 1

An RDD have memory footprints of different kinds:

1) it consumes memory on a driver (as a regular object)

2) information about this RDD is allocated on workers

3) if an RDD is cached it may allocate additional space on workers

When an RDD becomes unreachable in terms of (1) it cleanup of (2) and (3) is triggered via ContextCleaner . So we are talking only about (1).

It doesn't matter at all whether the RDD is cached or not. Performing actions like count / collect doesn't matter as well. RDD just dies as a regular java object when you leave the scope where this RDD is visible.

In your particular example RDD1 depends on RDD0 so the latter will not be evicted unless the former is evicted. And RDD1 will be evicted only after RDD2 which will be evicted only after RDD3 . And to unlock RDD3 for the garbage collector you must (vaguely speaking) leave the method where you use it.

Spark non-cached RDD Eviction from memory

Question

1 answers

solution1
4 ACCPTED 2019-04-28 12:26:12

Spark non-cached RDD Eviction from memory

Question

1 answers

solution1 4 ACCPTED 2019-04-28 12:26:12

solution1
4 ACCPTED 2019-04-28 12:26:12