简体   繁体   English

在Spark Streaming中处理太晚的数据

[英]Handle Too Late data in Spark Streaming

Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. 水印允许使用窗口在一段时间内将迟到的数据考虑为​​已包含在已计算的结果中。 Its premise is that it tracks to a point in time before which it is assumed no more late events are supposed to arrive, but if they do, they are none-the-less discarded . 它的前提是,它跟踪到某个时间点,在该时间点之前,假定不再有任何较晚的事件应该到达,但是如果发生,则仍然将其discarded

Is there a way to store the discarded data, that can be used for reconciliation purpose later? 有没有一种方法可以存储丢弃的数据,以便以后用于对帐? Say In my Structured Streaming, I set the watermark to 1 hour. 在结构化流媒体中,我将水印设置为1小时。 I am doing window operation for each 10 min and received a later event 20 min late. 我每10分钟执行一次窗口操作,并且晚20分钟收到一次以后的事件。 Is there a way I can store the discarded data say at a different location rather than discarding it? 有什么方法可以将丢弃的数据存储在另一个位置,而不是丢弃它?

不,没有办法实现这一方面。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM