简体   繁体   中英

Transform DStream RDD using external data

We are developing a spark streaming ETL application that will source the data from Kafka, apply necessary transformations and load the data into MongoDB. The data received from Kafka is in JSON format. The transformations are applied to each element(JSON String) of the RDD based on the lookup data fetched from MongoDB. Since the lookup data changes, I need to fetch it for every batch interval. The lookup data is read using SqlContext.read from MongoDB. I was not able to use SqlContext.read inside DStream.transform as the SqlContext is not serializable so I cannot broadcast it to the worker nodes. Now I try with DStream.foreachRDD inside which I fetch the data from MongoDB and broadcast the lookup data to the workers. All the transformations on the RDD elements are performed inside the rdd.map closure, which utilizes the broadcasted data and performs transformations and returns an RDD. The RDD is then converted to a dataframe and written to MongoDB. Currently, this application is running very slow.

PS: If I move the part of the code that fetches the lookup data out of DStream.foreachRDD and adds DStream.transform to apply transformations and have DStream.foreachRDD only insert the data into MongoDB, the performance is very good. But with this approach, the lookup data is not updated for each batch interval.

I am seeking help in understanding if this is a good approach and I am looking for some guidance to improve performance.

Following is a pseudo code

package com.testing


object Pseudo_Code {
  def main(args: Array[String]) {

    val sparkConf = new SparkConf().setAppName("Pseudo_Code")
      .setMaster("local[4]")

    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")

    val sqlContext = new SQLContext(sc)
    val ssc = new StreamingContext(sc, Seconds(1))

    val mongoIP = "127.0.0.1:27017"

    val DBConnectionURI = "mongodb://" + mongoIP + "/" + "DBName"

    val bootstrap_server_config = "127.0.0.100:9092"
    val zkQuorum = "127.0.0.101:2181"

    val group = "streaming"

    val TopicMap = Map("sampleTopic" -> 1)


    val KafkaDStream = KafkaUtils.createStream(ssc, zkQuorum,  group,  TopicMap).map(_._2)

     KafkaDStream.foreachRDD{rdd => 
       if (rdd.count() > 0) {

       //This lookup data has information required to perform transformation
       //This information keeps changing, so the data should be fetched for every batch interval

       val lookup1 = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
        .option("spark.mongodb.input.uri", DBConnectionURI)
        .option("spark.mongodb.input.collection", "lookupCollection1")
        .load()

      val broadcastLkp1 = sc.broadcast(lookup1)

      val new_rdd = rdd.map{elem => 
      val json: JValue = parse(elem)

      //Foreach element in rdd, there are some values that should be looked up from the broadcasted lookup data
      //"value" extracted from json
      val param1 = broadcastLkp1.value.filter(broadcastLkp1.value("key")==="value").select("param1").distinct()
      val param1ReplaceNull = if(param1.count() == 0){
                                  "constant"
                                }
                                else{
                                  param1.head().getString(0)
                                }
      //create a new JSON with a different structure
      val new_JSON = """"""

      compact(render(new_JSON))
     }

     val targetSchema = new StructType(Array(StructField("key1",StringType,true)
                                                  ,StructField("key2",TimestampType,true)))
     val transformedDf = sqlContext.read.schema(targetSchema).json(new_rdd)


     transformedDf.write
          .option("spark.mongodb.output.uri",DBConnectionURI)
          .option("collection", "tagetCollectionName")
          .mode("append").format("com.mongodb.spark.sql").save()
       }
   }

    // Run the streaming job
    ssc.start()
    ssc.awaitTermination()
  }




}

Upon researching, the solution that worked is to cache the broadcasted dataframe after it is read by the workers. Following is the code change that I had to do to improve the performance.

val new_rdd = rdd.map{elem => 
      val json: JValue = parse(elem)

      //Foreach element in rdd, there are some values that should be looked up from the broadcasted lookup data
      //"value" extracted from json
      val lkp_bd = broadcastLkp1.value
      lkp_bd.cache()
      val param1 = lkp_bd.filter(broadcastLkp1.value("key")==="value").select("param1").distinct()
      val param1ReplaceNull = if(param1.count() == 0){
                                  "constant"
                                }
                                else{
                                  param1.head().getString(0)
                                }
      //create a new JSON with a different structure
      val new_JSON = """"""

      compact(render(new_JSON))
     }

On a side note, this approach has a problem when run on the cluster. A null pointer exception is thrown while accessing the broadcasted dataframe. I created another thread. Spark Streaming - null pointer exception while accessing broadcast variable

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM