简体   繁体   English

使用 Spark Structured Streaming 对超时传感器数据进行分组

[英]Grouping sensor data overtime with Spark Structured Streaming

We have sensors starting and running for a random duration multiple times a day.我们有传感器每天多次启动和运行随机持续时间。 The data from the sensors are sent to a Kafka topic and is consumed by Spark Structured streaming API and is stored to a Delta Lake.来自传感器的数据被发送到 Kafka 主题并由 Spark 结构化流 API 使用并存储到 Delta Lake。 Now we have to identify and store sessions for each sensor in a different Delta Lake table partitioned by device_id and sensor_id.现在我们必须识别每个传感器的会话并将其存储在不同的 Delta Lake 表中,该表由 device_id 和 sensor_id 分区。

I tried with Spark Structured Streaming with watermarking but didn't do much good.我尝试使用带水印的 Spark Structured Streaming,但效果不佳。

stream2 = spark.readStream.format('delta')
             .load('<FIRST_DELTA_LAKE_TABLE>')
             .select('device_id', 'json', 'time')
             .withWatermark('timestamp', '10 minutes')
             .groupBy('device_id').agg(F.min('time').alias('min_time'), F.max('time').alias('max_time')))
             .writeStream
             .format("delta")
stream2.start("<SESSIONS_TABLE>")

The idea was to have second table identifying the sessions from incoming data and saving the start time and end time for each session and device.这个想法是让第二个表识别传入数据的会话并保存每个 session 和设备的开始时间和结束时间。 The streaming jobs runs and nothing gets written to the Sessions delta table.流作业运行,没有任何内容写入会话增量表。

Any help on this will be appreciated.对此的任何帮助将不胜感激。

By default, the when you're writing a stream, it uses the append mode by default (see the doc ).默认情况下,当您编写 stream 时,它默认使用append模式(请参阅文档)。 And in this mode, when you're using watermarks, the data output only after the watermark is crossed, so there will be at least 10 minutes delay until you start to see the data in the output.而在这种模式下,当你使用水印时,数据output只有在水印被越过之后才会出现,所以至少会有10分钟的延迟,直到你开始看到output中的数据。

But I think that the primary problem is that you're running the "global" aggregations, without defined window, or something like.但我认为主要问题是您正在运行“全局”聚合,但没有定义 window 或类似的东西。 Usually for session detection people use flatMapGroupWithState , something like described in the following blog post .通常对于 session 检测,人们使用flatMapGroupWithState ,如以下博客文章中所述。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM