简体   繁体   English

Spark结构化流水印错误

[英]Spark Structured Streaming watermark error

跟进这个问题<\/a>

我有格式与以下相同的 json 流数据

|  A    | B                                        |
|-------|------------------------------------------|
|  ABC  |  [{C:1, D:1}, {C:2, D:4}]                | 
|  XYZ  |  [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] |

As per my understanding, watermarking is required only when you are performing window operation on event time. 根据我的理解,只有在事件时间执行窗口操作时才需要加水印。 Spark used watermarking to handle late data and for the same purpose Spark needs to save older aggregation. Spark使用水印来处理后期数据,出于同样的目的,Spark需要保存较旧的聚合。

The following link explains this very well with example: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking 以下链接通过示例解释了这一点: https//spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking

I don't see any window operations in your transformation and if that is the case then I think you can try running the stream query without watermarking. 我没有在转换中看到任何窗口操作,如果是这种情况,那么我认为您可以尝试运行流查询而不加水印。

在对火花流结构进行分组时,您必须在数据帧中已经有了水印,并在分组时将其考虑在内,方法是在您的聚合中包含水印窗口

    df.groupBy(col("dummy"), window(col("event_time"), "1 day")).

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM