繁体   English   中英

如何访问 Java Spark Broadcast 变量?

[英]How to access Java Spark Broadcast variable?

我正在尝试在 spark 中广播Dataset ,以便从map函数中访问它。 第一个 print 语句按预期返回广播数据集的第一行。 不幸的是,第二个 print 语句没有返回结果。 此时执行只是挂起。 知道我做错了什么吗?

    Broadcast<JavaRDD<Row>> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData.toJavaRDD());

    System.out.println("Data:" + broadcastedTrainingData.value().first());
    JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> {
        System.out.println("Data (map):" + broadcastedTrainingData.value().first());
        return RowFactory.create(row);
    });

以下伪代码突出了我想要实现的目标。 我的主要目标是广播训练数据集,这样我就可以在地图函数中使用它。

    public Dataset<Row> getWSSE(Dataset<Row> trainingData, int clusterRange) {
        StructType structType = new StructType();
        structType = structType.add("ClusterAm", DataTypes.IntegerType, false);
        structType = structType.add("Cost", DataTypes.DoubleType, false);

        List<Integer> stringAsList = new ArrayList<>();
        for (int clusterAm = 2; clusterAm < clusterRange + 2; clusterAm++) {
            stringAsList.add(clusterAm);
        }

        Broadcast<Dataset> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData);

        System.out.println("Data:" + broadcastedTrainingData.value().first());
        JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> RowFactory.create(row));

        StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("ClusterAm", DataTypes.IntegerType, false)});

        Dataset wsse = sqlContext.createDataFrame(rowRDD, schema).toDF();
        wsse.show();

        ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);

        Dataset result = wsse.map(
                (MapFunction<Row, Row>) row -> RowFactory.create(row.getAs("ClusterAm"), new KMeans().setK(row.getAs("ClusterAm")).setSeed(1L).fit(broadcastedTrainingData.value()).computeCost(broadcastedTrainingData.value())),
                encoder);

        result.show();
        broadcastedTrainingData.destroy();
        return wsse;
    }
        DataSet<Row> trainingData = ...<Your dataset>;
                            
       //Creating the broadcast variable. No need to write classTag code by hand 
       // use akka.japi.Util which is available
                        
        Broadcast<Dataset<Row>> broadcastedTrainingData = spark.sparkContext()
              .broadcast(trainingData, akka.japi.Util.classTag(DataSet.class));
                            
        //Here is the catch.When you are iterating over a Dataset, 
        //Spark will actally run it in distributed mode. So if you try to accees
        //Your object directly (e.g. trainingData) it would be null . 
        //Cause you didn't ask spark to explicitly send tha outside variable to
        //each machine where you are running this for each parallelly.
        //So you need to use Broadcast variable.(Most common use of Broadcast)  
        
        someSparkDataSet.foreach((row) -> {
         DataSet<Row>  recieveBrdcast = broadcastedTrainingData.value();
         ...
         ...
        })

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM