[英]Handle Too Late data in Spark Streaming
Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. 水印允许使用窗口在一段时间内将迟到的数据考虑为已包含在已计算的结果中。 Its premise is that it tracks to a point in time before which it is assumed no more late events are supposed to arrive, but if they do, they are none-the-less
discarded
. 它的前提是,它跟踪到某个时间点,在该时间点之前,假定不再有任何较晚的事件应该到达,但是如果发生,则仍然将其
discarded
。
Is there a way to store the discarded data, that can be used for reconciliation purpose later? 有没有一种方法可以存储丢弃的数据,以便以后用于对帐? Say In my Structured Streaming, I set the watermark to 1 hour.
在结构化流媒体中,我将水印设置为1小时。 I am doing window operation for each 10 min and received a later event 20 min late.
我每10分钟执行一次窗口操作,并且晚20分钟收到一次以后的事件。 Is there a way I can store the discarded data say at a different location rather than discarding it?
有什么方法可以将丢弃的数据存储在另一个位置,而不是丢弃它?
不,没有办法实现这一方面。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.