[英]Spark structured streaming scala + confluent schema registry (json schema)
I have a spark structured streaming scala job which reads json messages from the kafka and writes the data to the S3.我有一个 spark 结构化流式传输 scala 作业,它从 kafka 读取 json 消息并将数据写入 S3。 I have a confluent schema registry configured and the schema is in json format with type=object.我配置了一个融合模式注册表,模式采用 json 格式,type=object。 Now I am able to retrieve the schema from the registry but I need to use this schema on the dataframe containing records from kafka.现在我可以从注册表中检索架构,但我需要在包含来自 kafka 的记录的 dataframe 上使用此架构。
val restService = new RestService(schemaRegistryURL)
val valueRestResponseSchema = restService.getLatestVersion(schemaName) // return type is io.confluent.kafka.schemaregistry.client.rest.entities.Schema
Now I want to use valueRestResponseSchema to the below code.现在我想在下面的代码中使用 valueRestResponseSchema。 How do I convert the valueRestResponseSchema to structtype to be able to apply in from_json?如何将 valueRestResponseSchema 转换为 structtype 以便能够在 from_json 中应用? val values: DataFrame = df.selectExpr("CAST(value AS STRING) as data").select(from_json(col("data"), valueRestResponseSchema).as("data")) val 值:DataFrame = df.selectExpr("CAST(value AS STRING) as data").select(from_json(col("data"), valueRestResponseSchema).as("data"))
Is there any Json schema converters available to use?是否有任何 Json 架构转换器可供使用? Something similar to below post but for json. Integrating Spark Structured Streaming with the Confluent Schema Registry类似于下面的帖子,但适用于 json。 将 Spark Structured Streaming 与 Confluent Schema Registry 集成
convert the valueRestResponseSchema to structtype将 valueRestResponseSchema 转换为结构类型
You can get the raw json schema from that object, but you'll need to manually convert into a Spark Struct, if you cannot find any json schema SparkSQL libraries on your own, since Spark doesn't offer that, like it does for Avro.您可以从 object 中获取原始 json 模式,但如果您自己找不到任何 json 模式 SparkSQL 库,则需要手动转换为 Spark 结构,因为 Spark 不提供它,就像 Avro 那样.
The Schema isn't required, by the way.顺便说一句,模式不是必需的。 You can use get_json_object
with JSONPath expressions against a string.您可以将get_json_object
与 JSONPath 表达式一起用于字符串。
However, you'll need to use substring
SparkSQL function to remove the first 5 bytes of the value before being able to parse the raw json value.但是,您需要使用substring
SparkSQL function 删除值的前 5 个字节,然后才能解析原始 json 值。
reads json messages from the kafka and writes the data to the S3从kafka读取json条消息并将数据写入S3
Or you can use Confluent S3 Sink connector instead.或者您可以改用 Confluent S3 Sink 连接器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.