在Spark Streaming中重用kafka生产者

Question

We have a spark streaming application(following is the code) that sources data from kafka and does some transformations(on each message) before inserting the data into MongoDB. 我们有一个火花流应用程序（以下是代码），它从kafka中获取数据并在将数据插入MongoDB之前进行一些转换（在每条消息上）。 We have a middleware application that pushes the messages(in bulk) into Kafka and waits for an acknowledgement(for each message) from spark streaming application. 我们有一个中间件应用程序将消息（批量）推送到Kafka并等待来自spark流应用程序的确认（对于每个消息）。 If the acknowledgement is not received by the middleware within a certain period of time(5seconds) after sending the message into Kafka, the middleware application re-sends the message. 如果在将消息发送到Kafka之后的某个时间段（5秒）内中间件未收到确认，则中间件应用程序重新发送该消息。 The spark streaming application is able to receive around 50-100 messages(in one batch) and send acknowledgement for all the messages under 5 seconds. 火花流应用程序能够接收大约50-100条消息（一批）并在5秒内发送所有消息的确认。 But if the middleware application pushes over 100 messages, it is resulting in middleware application re-sending the message due to delay in spark streaming sending the acknowledgement. 但是，如果中间件应用程序推送超过100条消息，则由于发送确认的火花流延迟，导致中间件应用程序重新发送消息。 In our current implementation, we create the producer each time we want to send an acknowledgement, which is taking 3-4 seconds. 在我们当前的实现中，我们每次要发送确认时都会创建生成器，这需要3-4秒。

package com.testing

import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

import org.joda.time._
import org.joda.time.format._

import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON

import scala.io.Source._
import java.util.Properties
import java.util.Calendar

import scala.collection.immutable
import org.json4s.DefaultFormats


object Sample_Streaming {

  def main(args: Array[String]) {

    val sparkConf = new SparkConf().setAppName("Sample_Streaming")
      .setMaster("local[4]")

    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")

    val sqlContext = new SQLContext(sc)
    val ssc = new StreamingContext(sc, Seconds(1))

    val props = new HashMap[String, Object]()


    val bootstrap_server_config = "127.0.0.100:9092"
    val zkQuorum = "127.0.0.101:2181"



    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

    val TopicMap = Map("sampleTopic" -> 1)
    val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)

      val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
        .option("spark.mongodb.input.uri", "connectionURI")
        .option("spark.mongodb.input.collection", "schemaCollectionName")
        .load()

      val outSchema = schemaDf.schema
      var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)

    KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
      {
        val jsonInput: JValue = parse(x)


        /*Do all the transformations using Json libraries*/

        val json4s_transformed = "transformed json"

        val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
        val df = sqlContext.read.schema(outSchema).json(rdd)

        df.write.option("spark.mongodb.output.uri", "connectionURI")
                  .option("collection", "Collection")
                  .mode("append").format("com.mongodb.spark.sql").save()

        val producer = new KafkaProducer[String, String](props)
        val message = new ProducerRecord[String, String]("topic_name", null, "message_received")

        producer.send(message)
        producer.close()


      }

    }

    )

    // Run the streaming job
    ssc.start()
    ssc.awaitTermination()
  }

}

So we tried another approach of creating the producer outside of the foreachRDD and reuse it for the entire batch interval(following is the code). 因此，我们尝试了另一种在foreachRDD之外创建生成器的方法，并在整个批处理间隔中重用它（以下是代码）。 This seem to have helped as we are not creating the producer each time we want to send the acknowledgement. 这似乎有所帮助，因为我们每次要发送确认时都没有创建生产者。 But for some reason, when we monitor the application on the spark UI, the streaming application's memory consumption is increasing steadily, which was not the case before. 但出于某种原因，当我们在spark UI上监视应用程序时，流应用程序的内存消耗正在稳步增加，而事实并非如此。 We tried using the --num-executors 1 option in spark-submit to limit the number of executors that get initiated by yarn. 我们尝试在spark-submit中使用--num-executors 1选项来限制由yarn启动的执行程序的数量。

    object Sample_Streaming {

    def main(args: Array[String]) {

    val sparkConf = new SparkConf().setAppName("Sample_Streaming")
      .setMaster("local[4]")

    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")

    val sqlContext = new SQLContext(sc)
    val ssc = new StreamingContext(sc, Seconds(1))

    val props = new HashMap[String, Object]()


    val bootstrap_server_config = "127.0.0.100:9092"
    val zkQuorum = "127.0.0.101:2181"



    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

    val TopicMap = Map("sampleTopic" -> 1)
    val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)

      val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
        .option("spark.mongodb.input.uri", "connectionURI")
        .option("spark.mongodb.input.collection", "schemaCollectionName")
        .load()

      val outSchema = schemaDf.schema
    val producer = new KafkaProducer[String, String](props)
    KafkaDstream.foreachRDD(rdd => 
          {

            rdd.collect().map ( x =>
            {

              val jsonInput: JValue = parse(x)


              /*Do all the transformations using Json libraries*/

              val json4s_transformed = "transformed json"

              val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
              val df = sqlContext.read.schema(outSchema).json(rdd)

              df.write.option("spark.mongodb.output.uri", "connectionURI")
                        .option("collection", "Collection")
                        .mode("append").format("com.mongodb.spark.sql").save()


              val message = new ProducerRecord[String, String]("topic_name", null, "message_received")

              producer.send(message)
              producer.close()


            }

            )
        }

    )

    // Run the streaming job
    ssc.start()
    ssc.awaitTermination()
  }

}

My questions are: 我的问题是：

How do I monitor the spark application's memory consumption, currently we are manually monitoring the application every 5 minutes until it exhausts the memory available in our cluster(2 node 16GB each)? 如何监控spark应用程序的内存消耗，目前我们每隔5分钟手动监视一次应用程序，直到它耗尽了我们集群中可用的内存（每个节点16GB）？
What are the best practices that are followed in the industry while using Spark streaming and kafka? 使用Spark流媒体和kafka时，业界遵循的最佳做法是什么？

Answer 1

Kafka is a broker: It gives you delivery guarantees for the producer and the consumer. Kafka是一家经纪商：它为您提供生产者和消费者的交付保证。 It's overkill to implement an 'over the top' acknowledge mechanism between the producer and the consumer. 在生产者和消费者之间实施“过顶”的承认机制是过度的。 Ensure that the producer behaves correctly and that the consumer can recover in case of failure and the end-2-end delivery will be ensured. 确保生产者行为正确，并且消费者可以在发生故障时恢复，并确保终端2端交付。

Regarding the job, there's no wonder why its performance is poor: The processing is being done sequentially, element by element up to the point of the write to the external DB. 关于这项工作，难怪为什么它的性能很差：处理按顺序进行，逐个元素直到写入外部数据库。 This is plain wrong and should be addressed before attempting to fix any memory consumption issues. 这是完全错误的 ，应该在尝试修复任何内存消耗问题之前解决。

This process could be improved like: 这个过程可以改进如下：

val producer = // create producer

val jsonDStream = kafkaDstream.transform{rdd => rdd.map{elem => 
    val json = parse(elem)
    render(doAllTransformations(json)) // output should be a String-formatted JSON object
  }
}

jsonDStream.foreachRDD{ rdd => 
  val df = sqlContext.read.schema(outSchema).json(rdd) // transform the complete collection, not element by element
  df.write.option("spark.mongodb.output.uri", "connectionURI") // write in bulk, not one by one
    .option("collection", "Collection")
    .mode("append").format("com.mongodb.spark.sql").save()
  val msg = //create message  
  producer.send(msg)
  producer.flush() // force send. *DO NOT Close* otherwise it will not be able to send any more messages
}

This process could be improved further if we could replace all the string-centric JSON transformation by case class instances. 如果我们可以用case class实例替换所有以字符串为中心的JSON转换，那么这个过程可以进一步改进。

在Spark Streaming中重用kafka生产者

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-08-14 12:55:38

在Spark Streaming中重用kafka生产者

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-08-14 12:55:38

解决方案1
3 已采纳 2017-08-14 12:55:38