简体   繁体   English

广播变量未在分区Apache Spark内部显示

[英]Broadcast Variables not showing inside Partitions Apache Spark

Scenario and Problem: I want to add two attributes to JSON object based on the look up table values and insert the JSON to Mongo DB. 场景和问题:我想基于查找表值向JSON对象添加两个属性,然后将JSON插入Mongo DB。 I have broadcast variable which holds look up table. 我有保存查找表的广播变量。 However, i am not being able to access it inside foreachPartition as you can see in the code. 但是,如您在代码中所见,我无法在foreachPartition中访问它。 It does not give me any error but simply does not display anything. 它没有给我任何错误,但根本不显示任何内容。 Also, because of it i cant insert JSON to Mongo DB. 另外,由于这个原因,我无法将JSON插入Mongo DB。 I cant find any explanation to this behaviour. 我找不到这种行为的任何解释。 Any explanation or work around to make it work is much appreciated. 任何解释或变通使其正常工作都倍受赞赏。

Here is my full code: 这是我的完整代码:

object ProcessMicroBatchStreams {
val calculateDistance = udf { 
 (lat: String, lon: String) =>      
 GeoHash.getDistance(lat.toDouble, lon.toDouble) }
 val DB_NAME = "IRT"
 val COLLECTION_NAME = "sensordata"
 val records = Array[String]()

def main(args: Array[String]): Unit = {
  if (args.length < 0) {
  System.err.println("Usage: ProcessMicroBatchStreams <master> <input_directory>")
  System.exit(1)
}
val conf = new SparkConf()
  .setMaster("local[*]")
  .setAppName(this.getClass.getCanonicalName)
  .set("spark.hadoop.validateOutputSpecs", "false")
/*.set("spark.executor.instances", "3")
.set("spark.executor.memory", "18g")
.set("spark.executor.cores", "9")
.set("spark.task.cpus", "1")
.set("spark.driver.memory", "10g")*/

val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(60))
val sqc = new SQLContext(sc)
val gpsLookUpTable = MapInput.cacheMappingTables(sc, sqc).persist(StorageLevel.MEMORY_AND_DISK_SER_2)
val broadcastTable = sc.broadcast(gpsLookUpTable)


ssc.textFileStream("hdfs://localhost:9000/inputDirectory/")
  .foreachRDD { rdd =>
  //broadcastTable.value.show() // I can access broadcast value here
  if (!rdd.partitions.isEmpty) {
    val partitionedRDD = rdd.repartition(4)
    partitionedRDD.foreachPartition {
      partition =>
        println("Inside Partition")
        broadcastTable.value.show() // I cannot access broadcast value here
        partition.foreach {
          row =>
            val items = row.split("\n")
            items.foreach { item =>
              val mongoColl = MongoClient()(DB_NAME)(COLLECTION_NAME)
              val jsonObject = new JSONObject(item)
              val latitude = jsonObject.getDouble(Constants.LATITUDE)
              val longitude = jsonObject.getDouble(Constants.LONGITUDE)

              // The broadcast value is not being shown here
              // However, there is no error shown
              // I cannot insert the value into Mongo DB
              val selectedRow = broadcastTable.value
                .filter("geoCode LIKE '" + GeoHash.subString(latitude, longitude) + "%'")
                .withColumn("Distance", calculateDistance(col("Lat"), col("Lon")))
                .orderBy("Distance")
                .select(Constants.TRACK_KM, Constants.TRACK_NAME).take(1)
              if (selectedRow.length != 0) {
                jsonObject.put(Constants.TRACK_KM, selectedRow(0).get(0))
                jsonObject.put(Constants.TRACK_NAME, selectedRow(0).get(1))
              }
              else {
                jsonObject.put(Constants.TRACK_KM, "NULL")
                jsonObject.put(Constants.TRACK_NAME, "NULL")
              }
              val record = JSON.parse(jsonObject.toString()).asInstanceOf[DBObject]
              mongoColl.insert(record)
            }
        }
    }
  }
}
sys.addShutdownHook {
  ssc.stop(true, true)
}

ssc.start()
ssc.awaitTermination()
}
}

It looks like you're trying to broadcast an RDD. 您似乎正在尝试广播RDD。 Try something like this: 尝试这样的事情:

broadCastVal = gpsLookUpTable.collect
broadCastTable = sc.broadcast(broadCastVal)

You should be able to get the value you're expecting. 您应该能够获得期望的价值。

I am not totally sure about this but after two encounters as such i am writing this answer. 我对此不太确定,但是在两次相遇之后,我正在写这个答案。 I could broadcast a RDD but i am not able to access the value. 我可以广播RDD,但无法访问该值。 If i create a list or treeMap, i am being able to broadcast and retrieve the value as well. 如果我创建列表或treeMap,那么我也可以广播和检索值。 I am not sure why. 我不知道为什么。 Although, i haven't found any where written that we cant broadcast a RDD. 虽然,我还没有发现我们不能广播RDD的任何地方。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM