Reuse kafka producer in Spark Streaming

Question

We have a spark streaming application(following is the code) that sources data from kafka and does some transformations(on each message) before inserting the data into MongoDB. We have a middleware application that pushes the messages(in bulk) into Kafka and waits for an acknowledgement(for each message) from spark streaming application. If the acknowledgement is not received by the middleware within a certain period of time(5seconds) after sending the message into Kafka, the middleware application re-sends the message. The spark streaming application is able to receive around 50-100 messages(in one batch) and send acknowledgement for all the messages under 5 seconds. But if the middleware application pushes over 100 messages, it is resulting in middleware application re-sending the message due to delay in spark streaming sending the acknowledgement. In our current implementation, we create the producer each time we want to send an acknowledgement, which is taking 3-4 seconds.

package com.testing

import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

import org.joda.time._
import org.joda.time.format._

import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON

import scala.io.Source._
import java.util.Properties
import java.util.Calendar

import scala.collection.immutable
import org.json4s.DefaultFormats


object Sample_Streaming {

  def main(args: Array[String]) {

    val sparkConf = new SparkConf().setAppName("Sample_Streaming")
      .setMaster("local[4]")

    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")

    val sqlContext = new SQLContext(sc)
    val ssc = new StreamingContext(sc, Seconds(1))

    val props = new HashMap[String, Object]()


    val bootstrap_server_config = "127.0.0.100:9092"
    val zkQuorum = "127.0.0.101:2181"



    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

    val TopicMap = Map("sampleTopic" -> 1)
    val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)

      val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
        .option("spark.mongodb.input.uri", "connectionURI")
        .option("spark.mongodb.input.collection", "schemaCollectionName")
        .load()

      val outSchema = schemaDf.schema
      var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)

    KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
      {
        val jsonInput: JValue = parse(x)


        /*Do all the transformations using Json libraries*/

        val json4s_transformed = "transformed json"

        val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
        val df = sqlContext.read.schema(outSchema).json(rdd)

        df.write.option("spark.mongodb.output.uri", "connectionURI")
                  .option("collection", "Collection")
                  .mode("append").format("com.mongodb.spark.sql").save()

        val producer = new KafkaProducer[String, String](props)
        val message = new ProducerRecord[String, String]("topic_name", null, "message_received")

        producer.send(message)
        producer.close()


      }

    }

    )

    // Run the streaming job
    ssc.start()
    ssc.awaitTermination()
  }

}

So we tried another approach of creating the producer outside of the foreachRDD and reuse it for the entire batch interval(following is the code). This seem to have helped as we are not creating the producer each time we want to send the acknowledgement. But for some reason, when we monitor the application on the spark UI, the streaming application's memory consumption is increasing steadily, which was not the case before. We tried using the --num-executors 1 option in spark-submit to limit the number of executors that get initiated by yarn.

    object Sample_Streaming {

    def main(args: Array[String]) {

    val sparkConf = new SparkConf().setAppName("Sample_Streaming")
      .setMaster("local[4]")

    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")

    val sqlContext = new SQLContext(sc)
    val ssc = new StreamingContext(sc, Seconds(1))

    val props = new HashMap[String, Object]()


    val bootstrap_server_config = "127.0.0.100:9092"
    val zkQuorum = "127.0.0.101:2181"



    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

    val TopicMap = Map("sampleTopic" -> 1)
    val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)

      val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
        .option("spark.mongodb.input.uri", "connectionURI")
        .option("spark.mongodb.input.collection", "schemaCollectionName")
        .load()

      val outSchema = schemaDf.schema
    val producer = new KafkaProducer[String, String](props)
    KafkaDstream.foreachRDD(rdd => 
          {

            rdd.collect().map ( x =>
            {

              val jsonInput: JValue = parse(x)


              /*Do all the transformations using Json libraries*/

              val json4s_transformed = "transformed json"

              val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
              val df = sqlContext.read.schema(outSchema).json(rdd)

              df.write.option("spark.mongodb.output.uri", "connectionURI")
                        .option("collection", "Collection")
                        .mode("append").format("com.mongodb.spark.sql").save()


              val message = new ProducerRecord[String, String]("topic_name", null, "message_received")

              producer.send(message)
              producer.close()


            }

            )
        }

    )

    // Run the streaming job
    ssc.start()
    ssc.awaitTermination()
  }

}

My questions are:

How do I monitor the spark application's memory consumption, currently we are manually monitoring the application every 5 minutes until it exhausts the memory available in our cluster(2 node 16GB each)?
What are the best practices that are followed in the industry while using Spark streaming and kafka?

Answer 1

Kafka is a broker: It gives you delivery guarantees for the producer and the consumer. It's overkill to implement an 'over the top' acknowledge mechanism between the producer and the consumer. Ensure that the producer behaves correctly and that the consumer can recover in case of failure and the end-2-end delivery will be ensured.

Regarding the job, there's no wonder why its performance is poor: The processing is being done sequentially, element by element up to the point of the write to the external DB. This is plain wrong and should be addressed before attempting to fix any memory consumption issues.

This process could be improved like:

val producer = // create producer

val jsonDStream = kafkaDstream.transform{rdd => rdd.map{elem => 
    val json = parse(elem)
    render(doAllTransformations(json)) // output should be a String-formatted JSON object
  }
}

jsonDStream.foreachRDD{ rdd => 
  val df = sqlContext.read.schema(outSchema).json(rdd) // transform the complete collection, not element by element
  df.write.option("spark.mongodb.output.uri", "connectionURI") // write in bulk, not one by one
    .option("collection", "Collection")
    .mode("append").format("com.mongodb.spark.sql").save()
  val msg = //create message  
  producer.send(msg)
  producer.flush() // force send. *DO NOT Close* otherwise it will not be able to send any more messages
}

This process could be improved further if we could replace all the string-centric JSON transformation by case class instances.

Reuse kafka producer in Spark Streaming

Question

1 answers

solution1
3 ACCPTED 2017-08-14 12:55:38

Reuse kafka producer in Spark Streaming

Question

1 answers

solution1 3 ACCPTED 2017-08-14 12:55:38

solution1
3 ACCPTED 2017-08-14 12:55:38