简体   繁体   English

使用 spark 和 scala 写入 kafka 主题

[英]write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.我正在从 Kafka 主题读取数据并将收到的数据写回另一个 Kafka 主题。

Below is my code,下面是我的代码,

import org.apache.spark.sql.types._
                        import org.apache.spark.sql.functions._
                        import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}    
                        import org.apache.spark.sql.ForeachWriter
    //loading data from kafka
                        val data = spark.readStream.format("kafka")
                          .option("kafka.bootstrap.servers", "*******:9092")  
                          .option("subscribe", "PARAMTABLE")
                          .option("startingOffsets", "latest")
                          .load()  
    //Extracting value from Json
                        val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
                        val df1 = data.selectExpr("CAST(value AS STRING)")
                        val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
    //Insert into another Kafka topic
                        val topic = "SparkParamValues"
                        val brokers = "********:9092"
                        val writer = new KafkaSink(topic, brokers)
                        val query = dataDF.writeStream
                                    .foreach(writer)
                                    .outputMode("update")
                                    .start().awaitTermination()
                        
                        

I am getting the below error,我收到以下错误,

                <Console>:47:error :not found: type KafkaSink
                            val writer = new KafkaSink(topic, brokers) 
                    
                   
                

I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct.我是 spark 的新手,有人建议如何解决这个问题或验证上面的代码是否正确。 Thanks in advance.提前致谢。

In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.在 spark 结构化流中,您可以在使用现有的 Kafka DataStreamWriter 从另一个主题读取后写入 Kafka 主题,或者您可以通过扩展 ForeachWriter class 创建自己的接收器。

Without using custom sink:不使用自定义接收器:

You can use below code to write a dataframe to kafka.您可以使用以下代码将 dataframe 写入 kafka。 Assuming df as the dataframe generated by reading from kafka topic.假设df为读取kafka topic生成的dataframe。 Here dataframe should have atleast one column with name as value.这里 dataframe 应该至少有一列以名称作为值。 If you have multiple columns you should merge them into one column and name it as value.如果您有多个列,则应将它们合并为一列并将其命名为值。 If key column is not specified then key will be marked as null in destination topic.如果未指定键列,则键将在目标主题中标记为 null。

  df.select("key", "value")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("topic", "<topicName>")
  .start()
  .awaitTermination()

Using custom sink:使用自定义接收器:

If you want to implement your own Kafka sink you need create a class by extending ForeachWriter .如果你想实现你自己的 Kafka 接收器,你需要通过扩展ForeachWriter 创建一个 class You need override some methods and pass the object of this class to foreach() method.您需要覆盖一些方法并将此 class 的 object 传递给 foreach() 方法。

   // By using Anonymous class to extend ForeachWriter
   df.writeStream.foreach(new ForeachWriter[Row] {
   // If you are writing Dataset[String] then new ForeachWriter[String]

     def open(partitionId: Long, version: Long): Boolean = {
       // open connection
     }

     def process(record: String) = {
       // write rows to connection
     }

     def close(errorOrNull: Throwable): Unit = {
       // close the connection
     }
   }).start()

You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading).您可以查看此数据块笔记本以获取已实现的代码(向下滚动并查看Kafka Sink标题下的代码)。 I think you are referring to this page only.我认为您仅指此页面。 To solve the issue you need to make sure that KafkaSink class is available to your spark code.要解决此问题,您需要确保 KafkaSink class 可用于您的 spark 代码。 You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.您可以将 spark 代码文件和 class 文件放在同一个 package 中。如果您在 spark-shell 上运行,请在粘贴 spark 代码之前粘贴 KafkaSink class。

Read structured streaming kafka integration guide to explore more.阅读结构化流式传输 kafka 集成指南以探索更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM