简体   繁体   中英

Scala spark kafka code - functional approach

I've the following code in scala. I am using spark sql to pull data from hadoop, perform some group by on the result, serialize it and then write that message to Kafka.

I've written the code - but i want to write it in functional way. Should i create a new class with function 'getCategories' to get the categories from Hadoop? I am not sure how to approach this.

Here is the code

class ExtractProcessor {
  def process(): Unit = {

  implicit val formats = DefaultFormats

  val spark = SparkSession.builder().appName("test app").getOrCreate()

  try {
     val df = spark.sql("SELECT DISTINCT SUBCAT_CODE, SUBCAT_NAME, CAT_CODE, CAT_NAME " +
    "FROM CATEGORY_HIERARCHY " +
    "ORDER BY CAT_CODE, SUBCAT_CODE ")

     val result = df.collect().groupBy(row => (row(2), row(3)))
     val categories = result.map(cat =>
                    category(cat._1._1.toString(), cat._1._2.toString(),
                      cat._2.map(subcat =>
                      subcategory(subcat(0).toString(), subcat(1).toString())).toList))

     val jsonMessage = write(categories)
     val kafkaKey = java.security.MessageDigest.getInstance("SHA-1").digest(jsonMessage.getBytes("UTF-8")).map("%02x".format(_)).mkString.toString()
     val key = write(kafkaKey)

     Logger.log.info(s"Json Message: ${jsonMessage}")
     Logger.log.info(s"Kafka Key: ${key}")

     KafkaUtil.apply.send(key, jsonMessage, "testTopic")      
}

And here is the Kafka Code

class KafkaUtil {
  def send(key: String, message: String, topicName: String): Unit = {
  val properties = new Properties()
  properties.put("bootstrap.servers", "localhost:9092")
  properties.put("client.id", "test publisher")
  properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
  properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
  val producer = new KafkaProducer[String, String](properties)

  try {

    val record = new ProducerRecord[String, String](topicName, key, message)
    producer.send(record)
  }
  finally {
    producer.close()
    Logger.log.info("Kafka producer closed...")
  }
 }
}

object KafkaUtil {
  def apply: KafkaUtil = {
  new KafkaUtil
 }
}

Also, for writing unit tests what should i be testing in the functional approach. In OOP we unit test the business logic, but in my scala code there is hardly any business logic.

Any help is appreciated.

Thanks in advance, Suyog

You code consists of 1) Loading the data into spark df 2) Crunching the data 3) Creating a json message 4) Sending json message to kafka

Unit tests are good for testing pure functions. You can extract step 2) into a method with signature like def getCategories(df: DataFrame): Seq[Category] and cover it by a test. In the test data frame will be generated from just a plain hard-coded in-memory sequence.

Step 3) can be also covered by a unit test if you feel it error-prone

Steps 1) and 4) are to be covered by an end-to-end test

By the way val result = df.collect().groupBy(row => (row(2), row(3))) is inefficient. Better to replace it by val result = df.groupBy(row => (row(2), row(3))).collect

Also there is no need to initialize a KafkaProducer individually for each single message.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM