Null Pointer Exception in Java Spark DStream when used a variable inside DStream Lambda Closure in Spark Cluster Mode

Question

I have defined one Broadcast array list as public static and this array list (name of array list: "qList") filled with new value when new job started in job handler method then used this array list inside DStream lambda Closure but when run on spark cluster, failed job and be shown message "Null Pointer Exception":

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 17.0 failed 4 times, most recent failure: Lost task 1.3 in stage 17.0 (TID 40, 192.168.1.97, executor 0): java.lang.NullPointerException at QProcessing.lambda$3(QProcessing.java:345)...

My Code:

@Override
    public void onBatchSubmitted(StreamingListenerBatchSubmitted arg0) {
        // TODO Auto-generated method stub

        QProcessing.qList.value().clear();
        for(int i = 0; i < 2; i++)
        try {
            QProcessing.qList.value().add(i, QProcessing.bufferedReader.readLine());
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
...

private static JavaPairDStream<Long, List<String>> DistributeSerach(
            JavaPairDStream<Long, BPlusTree<Integer, String>> inputRDD, int role, int accessControlType,
            boolean topkAttach,int i) {
        return inputRDD.mapToPair(index -> {
            List<String> searchResult = null;
            Instant startDistributedBPTSearch = Instant.now();
            searchResult = index._2.searchRange(Integer.parseInt(QProcessing.qList.value()[i].split(",")[0]),
                    BPlusTree.RangePolicy.INCLUSIVE,Integer.parseInt(QProcessing.qList.value()[i].split(",")[1]),
                    BPlusTree.RangePolicy.INCLUSIVE, role, accessControlType, topkAttach);
            Instant endDistributedBPTSearch = Instant.now();
            Duration timeElapsedDistributedBPTSearch = Duration.between(startDistributedBPTSearch,
                    endDistributedBPTSearch);
            Tuple2<Long, List<String>> tuple = new Tuple2<Long, List<String>>(
                    timeElapsedDistributedBPTSearch.toMillis(), searchResult);
            return tuple;
        });
    }

Answer 1

There are differences where the instructions are executed with spark. The definition of RDDs (only their instantiation, not their use) is made in the driver, and modifications and actions on the RDD are made in the executors (your lambda).

Each of these parts is running on different machines with different JVMs. If you modify a static property, it only changes the local JVM, so adding elements on the driver property is not enough.

I think the best solution would be not to use a lambda but to have an object and add the broadcast variable. Something like that.

  // You must define what are your A, B and C types
  public class MapToPairFunction extends PairFunction<A, B, C> {
      private Broadcast<List<String>> broadcast;

      public void setBroadcast(Broadcast<List<String>> broadcast) {
        this.broadcast = broadcast;
      }

      @Override
      public Tuple2<B, C> call(final A parameter) {
        // Here the code in the lambda
      }

  }

  private static JavaPairDStream<Long, List<String>> DistributeSerach(
    JavaPairDStream<Long, BPlusTree<Integer, String>> inputRDD, int role, int accessControlType,
    boolean topkAttach,int i) {
    PairFunction<A, B, C> pairFunction = new MapToPairFunction();
    pairFunction.setBroadcast(QProcessing.qList);
    return inputRDD.mapToPair(pairFunction);
  }

I haven't work with spark for a long time but I hope it helps you

Null Pointer Exception in Java Spark DStream when used a variable inside DStream Lambda Closure in Spark Cluster Mode

Question

1 answers

solution1
0 2021-05-28 12:04:20

Null Pointer Exception in Java Spark DStream when used a variable inside DStream Lambda Closure in Spark Cluster Mode

Question

1 answers

solution1 0 2021-05-28 12:04:20

solution1
0 2021-05-28 12:04:20