簡體   English   中英

Spark:使用 Spark Scala 從 Kafka 讀取 Avro 消息

[英]Spark: Reading Avro messages from Kafka using Spark Scala

我在spark 2.4.3嘗試使用以下代碼來讀取來自 kafka 的 Avro 消息。

當數據在 kafka 上發布時, confluent schema registry存儲在confluent schema registry 我一直在嘗試一些已經在這里討論過的解決方案( Integrating Spark Structured Streaming with the Confluent Schema Registry / Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming) )但無法使其工作。 或者我找不到這樣做的正確方法,尤其是當架構存儲在某個Schema Registry

這是我正在嘗試的當前代碼,至少我能夠得到一些結果,但所有記錄都作為null值出現。 其實題目有數據。 有人可以幫我解決這個問題嗎?

import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.avro.SchemaConverters

object ScalaSparkAvroConsumer {

    private val topic = "customer.v1"
    private val kafkaUrl = "localhost:9092"
    private val schemaRegistryUrl = "http://127.0.0.1:8081"

    private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
    private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)

    private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema
    private var sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))

    def main(args: Array[String]): Unit = {
      val spark = getSparkSession()

      spark.sparkContext.setLogLevel("ERROR")

      spark.udf.register("deserialize", (bytes: Array[Byte]) =>
        DeserializerWrapper.deserializer.deserialize(bytes)
      )

      val df = spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", kafkaUrl)
        .option("subscribe", topic)
        .option("startingOffsets", "earliest")
        .load()

      val valueDataFrame = df.selectExpr("""deserialize(value) AS message""")

      import org.apache.spark.sql.functions._

      val formattedDataFrame = valueDataFrame.select(
        from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
        .select("parsed_value.*")

      formattedDataFrame
        .writeStream
        .format("console")
        .option("truncate", false)
        .start()
        .awaitTermination()
    }

    object DeserializerWrapper {
      val deserializer = kafkaAvroDeserializer
    }

    class AvroDeserializer extends AbstractKafkaAvroDeserializer {
      def this(client: SchemaRegistryClient) {
        this()
        this.schemaRegistry = client
      }

      override def deserialize(bytes: Array[Byte]): String = {
        val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
        genericRecord.toString
      }
    }
}

得到如下輸出:

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------+
|header|control|
+------+-------+
|null  |null   |
|null  |null   |
|null  |null   |
|null  |null   |
+------+-------+
only showing top 20 rows        

Avro 序列化、Kafka 模式服務器和 Spark Streaming 與 from_confluence_avro() 的集成將使您的生活更輕松。 你可以在這里找到它:

https://github.com/AbsaOSS/ABRiS

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM