简体   繁体   English

Spark 结构化流式传输 scala + 融合模式注册表(json 模式)

[英]Spark structured streaming scala + confluent schema registry (json schema)

I have a spark structured streaming scala job which reads json messages from the kafka and writes the data to the S3.我有一个 spark 结构化流式传输 scala 作业,它从 kafka 读取 json 消息并将数据写入 S3。 I have a confluent schema registry configured and the schema is in json format with type=object.我配置了一个融合模式注册表,模式采用 json 格式,type=object。 Now I am able to retrieve the schema from the registry but I need to use this schema on the dataframe containing records from kafka.现在我可以从注册表中检索架构,但我需要在包含来自 kafka 的记录的 dataframe 上使用此架构。

val restService = new RestService(schemaRegistryURL)
val valueRestResponseSchema = restService.getLatestVersion(schemaName) // return type is io.confluent.kafka.schemaregistry.client.rest.entities.Schema

Now I want to use valueRestResponseSchema to the below code.现在我想在下面的代码中使用 valueRestResponseSchema。 How do I convert the valueRestResponseSchema to structtype to be able to apply in from_json?如何将 valueRestResponseSchema 转换为 structtype 以便能够在 from_json 中应用? val values: DataFrame = df.selectExpr("CAST(value AS STRING) as data").select(from_json(col("data"), valueRestResponseSchema).as("data")) val 值:DataFrame = df.selectExpr("CAST(value AS STRING) as data").select(from_json(col("data"), valueRestResponseSchema).as("data"))

Is there any Json schema converters available to use?是否有任何 Json 架构转换器可供使用? Something similar to below post but for json. Integrating Spark Structured Streaming with the Confluent Schema Registry类似于下面的帖子,但适用于 json。 将 Spark Structured Streaming 与 Confluent Schema Registry 集成

convert the valueRestResponseSchema to structtype将 valueRestResponseSchema 转换为结构类型

You can get the raw json schema from that object, but you'll need to manually convert into a Spark Struct, if you cannot find any json schema SparkSQL libraries on your own, since Spark doesn't offer that, like it does for Avro.您可以从 object 中获取原始 json 模式,但如果您自己找不到任何 json 模式 SparkSQL 库,则需要手动转换为 Spark 结构,因为 Spark 不提供它,就像 Avro 那样.

The Schema isn't required, by the way.顺便说一句,模式不是必需的。 You can use get_json_object with JSONPath expressions against a string.您可以将get_json_object与 JSONPath 表达式一起用于字符串。

However, you'll need to use substring SparkSQL function to remove the first 5 bytes of the value before being able to parse the raw json value.但是,您需要使用substring SparkSQL function 删除值的前 5 个字节,然后才能解析原始 json 值。

reads json messages from the kafka and writes the data to the S3从kafka读取json条消息并将数据写入S3

Or you can use Confluent S3 Sink connector instead.或者您可以改用 Confluent S3 Sink 连接器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用经过身份验证的 Confluent Schema Registry 配置 Spark Structured Streaming - Configuring Spark Structured Streaming with authenticated Confluent Schema Registry Spark 3.2.0 Structured Streaming 使用 Confluent Schema Registry 将数据保存到 Kafka - Spark 3.2.0 Structured Streaming save data to Kafka with Confluent Schema Registry Scala schema_of_json function 在 spark 结构化流中失败 - Scala schema_of_json function fails in spark structured streaming Spark 结构化流:Scala 中的模式推理 - Spark structured streaming: Schema Inference in Scala 使用 Spark Structured Streaming 读取带有架构的 Kafka Connect JSONConverter 消息 - Reading Kafka Connect JSONConverter messages with schema using Spark Structured Streaming Spark结构化流式数据砖事件中心模式定义问题 - Spark Structured Streaming Databricks Event Hub Schema Defining issue 在 Spark Structured Streaming 中应用消息级别而不是 dataframe 级别的架构 - Applying schema at message level instead of dataframe level in Spark Structured Streaming Spark Scala 中的模棱两可的模式 - Ambiguous schema in Spark Scala 如何将 Spark 模式应用于 Spark Structured Streaming 中基于 Kafka 主题名称的查询? - How to apply Spark schema to the query based on Kafka topic name in Spark Structured Streaming? 为 Spark 结构化流解析 JSON - Parse JSON for Spark Structured Streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM