简体   繁体   中英

How to access Java Spark Broadcast variable?

i am trying to broadcast a Dataset in spark in order to access it from within a map function. The first print statement returns the first line of the broadcasted dataset as expected. Unfortunately, the second print statement does not return a result. The execution simply hangs at this point. Any idea what I'm doing wrong?

    Broadcast<JavaRDD<Row>> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData.toJavaRDD());

    System.out.println("Data:" + broadcastedTrainingData.value().first());
    JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> {
        System.out.println("Data (map):" + broadcastedTrainingData.value().first());
        return RowFactory.create(row);
    });

The following pseudocode hightlights what i want to achieve. My main goal is to broadcast the training dataset, so i can use it from within a map function.

    public Dataset<Row> getWSSE(Dataset<Row> trainingData, int clusterRange) {
        StructType structType = new StructType();
        structType = structType.add("ClusterAm", DataTypes.IntegerType, false);
        structType = structType.add("Cost", DataTypes.DoubleType, false);

        List<Integer> stringAsList = new ArrayList<>();
        for (int clusterAm = 2; clusterAm < clusterRange + 2; clusterAm++) {
            stringAsList.add(clusterAm);
        }

        Broadcast<Dataset> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData);

        System.out.println("Data:" + broadcastedTrainingData.value().first());
        JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> RowFactory.create(row));

        StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("ClusterAm", DataTypes.IntegerType, false)});

        Dataset wsse = sqlContext.createDataFrame(rowRDD, schema).toDF();
        wsse.show();

        ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);

        Dataset result = wsse.map(
                (MapFunction<Row, Row>) row -> RowFactory.create(row.getAs("ClusterAm"), new KMeans().setK(row.getAs("ClusterAm")).setSeed(1L).fit(broadcastedTrainingData.value()).computeCost(broadcastedTrainingData.value())),
                encoder);

        result.show();
        broadcastedTrainingData.destroy();
        return wsse;
    }
        DataSet<Row> trainingData = ...<Your dataset>;
                            
       //Creating the broadcast variable. No need to write classTag code by hand 
       // use akka.japi.Util which is available
                        
        Broadcast<Dataset<Row>> broadcastedTrainingData = spark.sparkContext()
              .broadcast(trainingData, akka.japi.Util.classTag(DataSet.class));
                            
        //Here is the catch.When you are iterating over a Dataset, 
        //Spark will actally run it in distributed mode. So if you try to accees
        //Your object directly (e.g. trainingData) it would be null . 
        //Cause you didn't ask spark to explicitly send tha outside variable to
        //each machine where you are running this for each parallelly.
        //So you need to use Broadcast variable.(Most common use of Broadcast)  
        
        someSparkDataSet.foreach((row) -> {
         DataSet<Row>  recieveBrdcast = broadcastedTrainingData.value();
         ...
         ...
        })

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM