[英]How to Read data from kafka topic with different schema (has some optional objects) in structured streaming
i have data coming in kafka topic which has an optional object, and since its optional i am missing those records when reading with a defined schema我有来自 kafka 主题的数据,其中有一个可选的 object,并且由于它是可选的,因此在使用定义的模式读取时我会丢失这些记录
ex:前任:
schema i have:架构我有:
val schema =new StructType()
.add("obj_1", new ArrayType(
new StructType(
Array(
StructField("field1",StringType),
StructField("field2",StringType),
StructField("obj_2",new ArrayType(
new StructType(
Array(
StructField("field3",StringType),
StructField("field4",LongType),
StructField("obj_3",new ArrayType(
new StructType(
Array(
StructField("field5",StringType),
StructField("field6",StringType),
)
),containsNull = true
)
)
)
),containsNull = true
)),
StructField("field7",StringType),
StructField("field8",StringType))), containsNull = true))
when publishing data to this topic we sometimes will not send obj_3 based on some conditions.在向该主题发布数据时,我们有时不会根据某些条件发送 obj_3。
so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present.因此,当阅读主题并将其映射到上述模式时,我们会丢失不包含那些 obj_3 并且只包含存在 obj_3 的数据的数据。
how to read the data which will not have obj_3 sometime.如何读取有时没有 obj_3 的数据。
sample code:示例代码:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","bootstrap.servers")
.option("subscribe", "topic.name")
.option("startingOffsets", "offset.reset)
.option("failOnDataLoss","true")
.load()
val cast = df.selectExpr( "CAST(value AS STRING)")
.as[( String)]
val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))
val finalDf = resultedDf.select(col("newDF.*"))
You could either你也可以
import org.apache.spark.sql.funtions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...
val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val resultedDf = cast
.withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
.withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3)))
.select(col("data.*"))
Please note, that I have written the code on my mobile, so you may find some syntax issues.请注意,我已经在手机上编写了代码,因此您可能会发现一些语法问题。 However, hope the idea gets across.但是,希望这个想法能得到传达。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.