[英]Aggregate data from kafka topic and upload it in the new topic
val jsonSchema = StructType(Array(
StructField("event_type", StringType),
StructField("category", StringType),
StructField("item_id", StringType),
StructField("item_price", IntegerType),
StructField("uid", StringType),
StructField("timestamp", LongType)
))
我有一个带有 json 值的 kafka 主题,如上述方案所示。 我需要从我拥有的最早时间戳到另一个 kafka 主题按小时放置 agg 数据。 我知道我需要使用 window 的更新方法,但我不明白如何以正确的方式进行
我假设我想像这样 stream
val newData = spark
.readStream
.format("kafka")
.options(kafkaParams)
.load
.select(from_json($"value".cast("string"), jsonSchema).alias("value"))
但我真的不明白如何使用 json 将其转换为新值,如下所示:
{"start_ts":1577865600,"end_ts":1577869200,"revenue": sum of item_price,"visitors": count of uids},
{"start_ts":1577869200,"end_ts":1577872800,"revenue":sum of item_price,"visitors":count of uids},
...
你可以这样做:
val parseddf = newData
.select('value.cast("string"))
.withColumn("value",from_json(col("value"),jsonSchema))
.select(col("value.*"))
.withColumn("timestamp", from_unixtime($"timestamp"/1000).cast(TimestampType))
val uData = parseddf
.withWatermark("timestamp", "60 minutes")
.na.fill("undefined")
.withColumn("uid", when($"uid" === "undefined", 0).otherwise(1))
.groupBy(window($"timestamp", "60 minutes"))
.agg(sum("item_price")) as("revenue"), sum("uid") as("visitors"))
.withColumn("start_ts", unix_timestamp($"window.start"))
.withColumn("end_ts", unix_timestamp($"window.end"))
.withColumn("value", to_json(struct($"start_ts", $"end_ts", $"revenue", $"visitors")))
.drop("window", "revenue", "visitors", "start_ts", "end_ts")
.writeStream
.outputMode("update")
.format("kafka")
.trigger(Trigger.ProcessingTime("10 seconds"))
.option("kafka.bootstrap.servers", "$server")
.option("checkpointLocation", s"$checkpoint")
.option("topic", s"$kafka_topic")
.start()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.