Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?

Question

I was reading about how to print RDDs in Spark (I'm using Java), and it seems like most people just collect() (if the RDD is small enough) and use forall(println), or something like that. Is it not possible to print in parallel? Why do we have to collect the data onto the driver node in order to print?

I was thinking maybe it's because we can't use System.out in parallel, but I feel like that's not it. And furthermore, I'm not quite sure how one would even distribute the data and print parallelly, in terms of code. One approach I was thinking of was to do a mappartitions that doesn't do anything useful in terms of mapping, but it iterates through the partition and prints its contents.

Answer 1

When you call the collect() method, you are returning all the results to the driver node. You will have a List instead of an RDD . Let's see an example in local mode . Suppose you have an RDD of Integer:

JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10));

If you call the foreach method ( stream().forEach() in Java), the driver node will print all the elements in the RDD in the same order you created it.

rdd.collect().stream().forEach(x -> System.out.println(x));

Output: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

If you want to print the results on each worker, you have to call the foreach method of the RDD . It will return nothing to the driver, and just will perform the computation you specify in the foreach method on each worker node.

rdd.foreach(x -> System.out.println(x));

If you see the console ( local mode ), you will note that the System.out.println(x) has been executed in separated threads, since the output doesn't respect the original order:

Output: 6, 3, 2, 1, 8, 9, 10, 4, 5, 7

So if you execute it in distributed mode, each executor will print the result of the System.out.println operation on its log files.

You also mentioned the mapPartitions method. In your case, I don't find it more useful than using the foreach directly over the RDD . It just may helpful to control the workers workload.

 rdd.repartition(5).mapPartitions(x -> {
     while(x.hasNext()){
         Integer i = x.next();
         System.out.println(i);
     }
     return x;
 }).count(); // Count is just to force the execution of mapPartition (mapPartition is lazy and doesn't get executed until an action is called)

Hope it helps!

Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?

Question

1 answers

solution1
1 ACCPTED 2018-07-31 19:31:27

Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?

Question

1 answers

solution1 1 ACCPTED 2018-07-31 19:31:27

solution1
1 ACCPTED 2018-07-31 19:31:27