![](/img/trans.png)
[英]Scala: Error reading Kafka Avro messages from spark structured streaming
[英]Spark: Reading Avro messages from Kafka using Spark Scala
我在spark 2.4.3
尝试使用以下代码来读取来自 kafka 的 Avro 消息。
当数据在 kafka 上发布时, confluent schema registry
存储在confluent schema registry
。 我一直在尝试一些已经在这里讨论过的解决方案( Integrating Spark Structured Streaming with the Confluent Schema Registry / Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming) )但无法使其工作。 或者我找不到这样做的正确方法,尤其是当架构存储在某个Schema Registry
。
这是我正在尝试的当前代码,至少我能够得到一些结果,但所有记录都作为null
值出现。 其实题目有数据。 有人可以帮我解决这个问题吗?
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.avro.SchemaConverters
object ScalaSparkAvroConsumer {
private val topic = "customer.v1"
private val kafkaUrl = "localhost:9092"
private val schemaRegistryUrl = "http://127.0.0.1:8081"
private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema
private var sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
spark.sparkContext.setLogLevel("ERROR")
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes)
)
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
val valueDataFrame = df.selectExpr("""deserialize(value) AS message""")
import org.apache.spark.sql.functions._
val formattedDataFrame = valueDataFrame.select(
from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
formattedDataFrame
.writeStream
.format("console")
.option("truncate", false)
.start()
.awaitTermination()
}
object DeserializerWrapper {
val deserializer = kafkaAvroDeserializer
}
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}
得到如下输出:
-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------+
|header|control|
+------+-------+
|null |null |
|null |null |
|null |null |
|null |null |
+------+-------+
only showing top 20 rows
Avro 序列化、Kafka 模式服务器和 Spark Streaming 与 from_confluence_avro() 的集成将使您的生活更轻松。 你可以在这里找到它:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.