如何在结构化流中从具有不同模式（具有一些可选对象）的 kafka 主题中读取数据

Question

i have data coming in kafka topic which has an optional object, and since its optional i am missing those records when reading with a defined schema我有来自 kafka 主题的数据，其中有一个可选的 object，并且由于它是可选的，因此在使用定义的模式读取时我会丢失这些记录

ex:前任：

schema i have:架构我有：

val schema =new StructType()
      .add("obj_1", new ArrayType(
        new StructType(
          Array(
            StructField("field1",StringType),
            StructField("field2",StringType),
            StructField("obj_2",new ArrayType(
              new StructType(
                Array(
                  StructField("field3",StringType),
                  StructField("field4",LongType),
                  StructField("obj_3",new ArrayType(
                    new StructType(
                      Array(
                        StructField("field5",StringType),
                        StructField("field6",StringType),
                      )
                    ),containsNull = true
                  )
                  )
                )
              ),containsNull = true
            )),
            StructField("field7",StringType),
            StructField("field8",StringType))), containsNull = true))

when publishing data to this topic we sometimes will not send obj_3 based on some conditions.在向该主题发布数据时，我们有时不会根据某些条件发送 obj_3。

so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present.因此，当阅读主题并将其映射到上述模式时，我们会丢失不包含那些 obj_3 并且只包含存在 obj_3 的数据的数据。

how to read the data which will not have obj_3 sometime.如何读取有时没有 obj_3 的数据。

sample code:示例代码：

 val df = spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers","bootstrap.servers")
        .option("subscribe", "topic.name")
        .option("startingOffsets", "offset.reset)
        .option("failOnDataLoss","true")
        .load()

      val cast = df.selectExpr( "CAST(value AS STRING)")
        .as[( String)]

val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))

 val finalDf = resultedDf.select(col("newDF.*"))

Answer 1

You could either你也可以

use a flag (eg called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string.在 Kafka 消息的Key中使用一个标志（例如，在 JSON 结构中称为“obj3flag”），告诉您的结构化流作业是否 obj_3 存在于 Kafka值中，然后选择一个或另一个模式来解析 json细绳。 Something like:就像是：

import org.apache.spark.sql.funtions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...

val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
        .as[(String, String)]

val resultedDf = cast
  .withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
  .withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3))) 
  .select(col("data.*"))

do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json.在 Kafka 值（转换为字符串）中对“obj_3”进行字符串搜索，如果找到该字符串，则应用一个或另一个模式来解析 json。 The code will look very similar to the one for the other option.该代码看起来与另一个选项的代码非常相似。

Please note, that I have written the code on my mobile, so you may find some syntax issues.请注意，我已经在手机上编写了代码，因此您可能会发现一些语法问题。 However, hope the idea gets across.但是，希望这个想法能得到传达。

如何在结构化流中从具有不同模式（具有一些可选对象）的 kafka 主题中读取数据

问题描述

1 个解决方案

解决方案1
0 2021-05-13 05:57:31

如何在结构化流中从具有不同模式（具有一些可选对象）的 kafka 主题中读取数据

问题描述

1 个解决方案

解决方案1 0 2021-05-13 05:57:31

解决方案1
0 2021-05-13 05:57:31