Apache Spark take Action on Executors in fully distributed mode

Question

I am new to spark, i have the basic idea of how the transformation and action work ( guide ). I am trying some NLP operation on each line (basically paragraphs) in a text file. After processing, the result should be sent to a server (REST Api) for storage. The program is run as a spark job (submitted using spark-submit) on a cluster of 10 nodes in yarn mode. This is what i have done so far.

...
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<String> processedLines = lines
    .map(line -> {
        // processed here
        return result;
    });
processedLines.foreach(line -> {
    // Send to server
});

This works but the foreach loop seems sequential, it seems like it is not running in distributed mode on the worker nodes. Am i correct?

I tried the following code but it doesn't work. Error: java: incompatible types: inferred type does not conform to upper bound(s) . Obviously its wrong because map is a transformation, not an action.

lines.map(line -> { /* processing */ })
     .map(line -> { /* Send to server */ });

I also tried with take() , but it requires int and the processedLines.count() is of type long .

processedLines.take(processedLines.count()).forEach(pl -> { /* Send to server */ });

The data is huge (greater than 100gb). What i want is that both the processing and sending it to the server should be done on the worker nodes. The processing part in the map defiantly takes place on the worker nodes. But how do i send the processed data from the worker nodes to the server because the foreach seems sequential loop taking place in the driver (if i am correct). Simply put, how to execute action in the worker nodes and not in the driver program.

Any help will be highly appreciated.

Answer 1

foreach is an action in spark. It basically takes each element of the RDD and applies a function to that element.

foreach is performed on the executor nodes or worker nodes. It does not get applied on the driver node. Note that in the local execution mode of running spark both driver and executor node can reside on the same JVM.

Check this for reference foreach explanation

Your approach looks ok where you are trying to map each element of RDD and then apply foreach to each element. The reason which I can think of why it is taking time is because of the data size that you are dealing with(~100GB).

One way of doing the optimization to this is to repartition the input data set. Ideally each partition should be of size 128MB for better performance results. There are many articles which you will find about best practices for doing the repartition of the data. I would suggest you follow them, It will give some performance benefit.

The second optimization which you can think of doing is the memory that you assign to each executor node. It plays a very important role while doing spark tuning.

The third optimization that you can think of is, batch the network call to the server. You are currently doing network calls to the server for each element of the RDD. If your design allows you to batch these network calls, where you can send more than 1 element in a single network call. This might help as well if the latency produced is majorly due to these network calls.

I hope this helps.

Answer 2

Firstly when your code is running on Executors its already in distributed mode now when you want to utilize all the CPU resources on Executors for more parallelism you should go for some async options and more preferrably with batch mode operation to avoid excess creation of Client connection objects as below.

You can replace your code with

processedLines.foreach(line -> {

with either of the solution

processedLines.foreachAsync(line -> {
    // Send to server
}).get();

//To iterate batch wise I would go for this
processedLines.foreachPartitionAsync(lineIterator -> {
// Create your ouput client connection here
    while (lineIterator.hasNext()){
        String line  = lineIterator.next();
    }
}).get();

Both the function will create a Future object or submit a new thread or a unblocking call which would automatically add parallelism to your code.

Apache Spark take Action on Executors in fully distributed mode

Question

2 answers

solution1
1 ACCPTED 2020-05-27 13:02:49

solution2
1 2020-05-27 13:03:01

Apache Spark take Action on Executors in fully distributed mode

Question

2 answers

solution1 1 ACCPTED 2020-05-27 13:02:49

solution2 1 2020-05-27 13:03:01

solution1
1 ACCPTED 2020-05-27 13:02:49

solution2
1 2020-05-27 13:03:01