简体   繁体   中英

Apache Spark: How to structure code of a Spark Application (especially when using Broadcasts)

I have a generic question concerning the structuring of code in Java Spark applications. I want to separate the code for the implementation of Spark transformations from the calling on RDDs so the source code of the application stays clear even when using lots of transformations containing lots of lines of code.

I'll give you a short example first. In this scenario the implementation of a flatMap transformation is provided as an anonymous inner class. This is a simple application that reads an RDD of integers and then multiplies each element to an integer array which was broadcasted to all worker nodes before:

public static void main(String[] args) {

    SparkConf conf = new SparkConf().setMaster("local").setAppName("MyApp");
    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Integer> result = sc.parallelize(Arrays.asList(5, 8, 9));

    final Broadcast<int[]> factors = sc.broadcast(new int[] { 1, 2, 3 });

    result = result.flatMap(new FlatMapFunction<Integer, Integer>() {
        public Iterable<Integer> call(Integer t) throws Exception {
            int[] values = factors.value();
            LinkedList<Integer> result = new LinkedList<Integer>();
            for (int value : values) result.add(t * value);
            return result;
        }
    });

    System.out.println(result.collect());   // [5, 10, 15, 8, 16, 24, 9, 18, 27]

    sc.close();
}

In order to structure code I have extracted the implementation of the Spark functions to a different class. The class SparkFunctions provides the implementation for the flatMap transformation and has a setter method to get a reference to the broadcast variable (...in my real-world scenario there would be many operations in this class which all access the broadcasted data).

I have experienced that a method representing a Spark transformation can be static as long as it is not accessing a Broadcast variable or an Accumulator variable. Why? Static methods can only access static attributes. A static reference to a Broadcast variable is always null (probably as it is not serialized when Spark sends the class SparkFunctions to the worker nodes).

@SuppressWarnings("serial")
public class SparkFunctions implements Serializable {

    private Broadcast<int[]> factors;

    public SparkFunctions() {
    }

    public void setFactors(Broadcast<int[]> factors) {
        this.factors = factors;
    }

    public final FlatMapFunction<Integer, Integer> myFunction = new FlatMapFunction<Integer, Integer>() {
        public Iterable<Integer> call(Integer t) throws Exception {
            int[] values = factors.value();
            LinkedList<Integer> result = new LinkedList<Integer>();
            for (int value : values) result.add(t * value);
            return result;
        }
    };

}

This is the second version of the application using the class SparkFunctions :

public static void main(String[] args) {

    SparkConf conf = new SparkConf().setMaster("local").setAppName("MyApp");
    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Integer> result = sc.parallelize(Arrays.asList(5, 8, 9));

    final Broadcast<int[]> factors = sc.broadcast(new int[] { 1, 2, 3 });

    // 1) Initializing
    SparkFunctions functions = new SparkFunctions();

    // 2) Pass reference of broadcast variable
    functions.setFactors(factors);

    // 3) Implementation is now in the class SparkFunctions
    result = result.flatMap(functions.myFunction);

    System.out.println(result.collect());   // [5, 10, 15, 8, 16, 24, 9, 18, 27]

    sc.close();
}

Both versions of the application are working (locally and in a cluster setup) but I am asking if they are equally efficient.

Question 1 : In my opinion, Spark serializes the class SparkFunctions including the Broadcast variable and sends it to the worker nodes so that the nodes can use the function in their tasks. Is the data sent twice to the worker nodes, first on the broadcast using SparkContext , and then another time on the serialization of the class SparkFunctions ? Or is it even sent once per element (plus 1 for the broadcast)?

Question 2 : Can you provide me with suggestions on how the source code might be structured otherwise?

Please don't provide solutions how to prevent a Broadcast. I have a real-world application which is much more complex.

Similar questions that I have found which were not really helpful:

Thanks in advance for your help!

This is regarding the Question1

When a spark job is submitted, the jobs are divided into stages-> tasks. The tasks actually carries out the execution of the transformations and actions on worker nodes. The drivers's sumbitTask() will serialize the functions and metadata about the broadcast variable to all nodes.

Anatomy of how broadcast works.

The Driver creates a local directory to store the data to be broadcasted and launches a HttpServer with access to the directory. The data is actually written into the directory when the broadcast is called (val bdata = sc.broadcast(data)). At the same time, the data is also written into driver's blockManger with a StorageLevel memory + disk. Block manager allocates a blockId (of type BroadcastBlockId) for the data.

The real data is broadcasted only when an executor deserializes the task it has received, it also gets the broadcast variable's metadata, in the form of a Broadcast object. It then calls the readObject() method of the metadata object (bdata variable). This method will first check the local block manager to see if there's already a local copy. If not, the data will be fetched from the driver. Once the data is fetched, it's stored in the local block manager for subsequent uses.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM