简体   繁体   English

如何在结构化流中从具有不同模式(具有一些可选对象)的 kafka 主题中读取数据

[英]How to Read data from kafka topic with different schema (has some optional objects) in structured streaming

i have data coming in kafka topic which has an optional object, and since its optional i am missing those records when reading with a defined schema我有来自 kafka 主题的数据,其中有一个可选的 object,并且由于它是可选的,因此在使用定义的模式读取时我会丢失这些记录

ex:前任:

schema i have:架构我有:

val schema =new StructType()
      .add("obj_1", new ArrayType(
        new StructType(
          Array(
            StructField("field1",StringType),
            StructField("field2",StringType),
            StructField("obj_2",new ArrayType(
              new StructType(
                Array(
                  StructField("field3",StringType),
                  StructField("field4",LongType),
                  StructField("obj_3",new ArrayType(
                    new StructType(
                      Array(
                        StructField("field5",StringType),
                        StructField("field6",StringType),
                      )
                    ),containsNull = true
                  )
                  )
                )
              ),containsNull = true
            )),
            StructField("field7",StringType),
            StructField("field8",StringType))), containsNull = true))
   

when publishing data to this topic we sometimes will not send obj_3 based on some conditions.在向该主题发布数据时,我们有时不会根据某些条件发送 obj_3。

so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present.因此,当阅读主题并将其映射到上述模式时,我们会丢失不包含那些 obj_3 并且只包含存在 obj_3 的数据的数据。

how to read the data which will not have obj_3 sometime.如何读取有时没有 obj_3 的数据。

sample code:示例代码:

 val df = spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers","bootstrap.servers")
        .option("subscribe", "topic.name")
        .option("startingOffsets", "offset.reset)
        .option("failOnDataLoss","true")
        .load()

      val cast = df.selectExpr( "CAST(value AS STRING)")
        .as[( String)]

val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))

 val finalDf = resultedDf.select(col("newDF.*"))

You could either你也可以

  • use a flag (eg called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string.在 Kafka 消息的Key中使用一个标志(例如,在 JSON 结构中称为“obj3flag”),告诉您的结构化流作业是否 obj_3 存在于 Kafka中,然后选择一个或另一个模式来解析 json细绳。 Something like:就像是:
import org.apache.spark.sql.funtions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...

val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
        .as[(String, String)]

val resultedDf = cast
  .withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
  .withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3))) 
  .select(col("data.*"))
  • do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json.在 Kafka 值(转换为字符串)中对“obj_3”进行字符串搜索,如果找到该字符串,则应用一个或另一个模式来解析 json。 The code will look very similar to the one for the other option.该代码看起来与另一个选项的代码非常相似。

Please note, that I have written the code on my mobile, so you may find some syntax issues.请注意,我已经在手机上编写了代码,因此您可能会发现一些语法问题。 However, hope the idea gets across.但是,希望这个想法能得到传达。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Spark Structured Streaming 将数据从 Kafka 主题流式传输到 Delta 表 - How to stream data from Kafka topic to Delta table using Spark Structured Streaming 如何将 Spark 模式应用于 Spark Structured Streaming 中基于 Kafka 主题名称的查询? - How to apply Spark schema to the query based on Kafka topic name in Spark Structured Streaming? 错误:使用 Spark Structured Streaming 读取和写入数据到 kafka 中的另一个主题 - Error: Using Spark Structured Streaming to read and write data to another topic in kafka 如何使用结构化流从 Kafka 读取 JSON 格式的记录? - How to read records in JSON format from Kafka using Structured Streaming? 如何从Spark Streaming开始从Kafka主题中读取记录? - How to read records from Kafka topic from beginning in Spark Streaming? 如何从 Scala 中的 Kafka 主题读取数据 - How to read data from Kafka topic in Scala Spark 3.2.0 Structured Streaming 使用 Confluent Schema Registry 将数据保存到 Kafka - Spark 3.2.0 Structured Streaming save data to Kafka with Confluent Schema Registry Spark Structured Streaming 无法从 docker 内部的 kafka 读取 - Spark Structured Streaming cannot read from kafka inside docker Spark Streaming + Kafka:如何从kafka消息中检查主题名称 - Spark Streaming + Kafka: how to check name of topic from kafka message 如何在 spark 3.0 结构化流媒体中使用 kafka.group.id 和检查点以继续从 Kafka 中读取它在重启后停止的位置? - How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM