简体   繁体   English

Flink 的会话窗口缺少延迟输出

[英]Late outputs missing for Flink's Session Window

In my pipeline's setup I cannot see side outputs for Session Window.在我的管道设置中,我看不到会话窗口的侧输出。 I'm using Flink 1.9.1我正在使用 Flink 1.9.1

Version 1. What I have is this:版本 1. 我所拥有的是:

messageStream.
    .keyBy(tradeKeySelector)
    .window(ProcessingTimeSessionWindows.withDynamicGap(new TradeAggregationGapExtractor()))
    .sideOutputLateData(lateTradeMessages)
    .process(new CumulativeTransactionOperator())
    .name("Aggregate Transaction Builder");

lateTradeMessages implementes SessionWindowTimeGapExtractor and returns 5 secodns. lateTradeMessages 实现 SessionWindowTimeGapExtractor 并返回 5 秒。

Further I have this:此外,我有这个:

messageStream.getSideOutput(lateTradeMessages)
  .keyBy(tradeKeySelector)
  .process(new KeyedProcessFunction<Long, EnrichedMessage, Transaction>() {
     @Override
     public void processElement(EnrichedMessage value, Context ctx, Collector<Transaction> out) throws Exception {
                   System.out.println("Process Late messages For Aggregation");
                   out.collect(new Transaction());
              }
       })
   .name("Process Late messages For Aggregation");

The problem is that I never see "Process Late messages For Aggregation" when I'm sending messages with same key that should miss window time.问题是,当我使用相同的键发送消息时,我从来没有看到“处理聚合的延迟消息”,但应该会错过窗口时间。

When Session Window passes and I "immediately" sent a new message for the same key it triggers new Session Window without going into Late SideOutput.当会话窗口通过并且我“立即”为相同的键发送一条新消息时,它会触发新的会话窗口而不会进入 Late SideOutput。

Not sure What I'm doing wrong here.不确定我在这里做错了什么。

What I would like to achieve here, is to catch "late events" and try to reprocess them.我想在这里实现的是捕捉“后期事件”并尝试重新处理它们。

I will appreciate any help.我将不胜感激任何帮助。


Version 2, after @Dominik Wosiński comment:版本 2,@Dominik Wosiński 评论后:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1000, 1000));
        env.setParallelism(1);
        env.disableOperatorChaining();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.getConfig().setAutoWatermarkInterval(1000);


DataStream<RawMessage> rawBusinessTransaction = env
                .addSource(new FlinkKafkaConsumer<>("business",
                        new JSONKeyValueDeserializationSchema(false), properties))
                .map(new KafkaTransactionObjectMapOperator())
                .assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<RawMessage>() {

                    @Nullable
                    @Override
                    public Watermark getCurrentWatermark() {
                        return new Watermark(System.currentTimeMillis());
                    }

                    @Override
                    public long extractTimestamp(RawMessage element, long previousElementTimestamp) {
                        return element.messageCreationTime;
                    }
                })
                .name("Kafka Transaction Raw Data Source.");

messageStream
             .keyBy(tradeKeySelector)
             .window(EventTimeSessionWindows.withDynamicGap(new TradeAggregationGapExtractor()))
             .sideOutputLateData(lateTradeMessages)
             .process(new CumulativeTransactionOperator())
             .name("Aggregate Transaction Builder");

Watermarks are progressing, I've checked in Flink's Metrics.水印正在改进,我已经检查了 Flink 的指标。 The Window operator is execution, but still there are no Late Outputs. Window 操作符正在执行,但仍然没有延迟输出。

BTW, Kafka topic can be idle, so I have to emit new WaterMarks periodically.顺便说一句,Kafka 主题可以空闲,所以我必须定期发出新的 WaterMarks。


You are using ProcessingTime in Your case, this means that the system time is used to measure the flow of the time in the DataStream .您正在使用ProcessingTime在您的情况下,这意味着系统时间用于测量DataStream中的时间流。

For each event, the timestamp assigned to this event is the moment that You receive the data in Your Flink Pipeline.对于每个事件,分配给该事件的时间戳是您在 Flink Pipeline 中收到数据的时刻。 This means that there is no way to have events out-of-order for Flink processing time.这意味着 Flink 处理时间无法让事件乱序。 Because of that, You will never have late elements for Your windows.正因为如此,你的窗户永远不会有迟到的元素。

If You switch to EventTime , then for proper input data You should be able to see the late elements being passed to side output.如果您切换到EventTime ,那么对于正确的输入数据,您应该能够看到传递到侧输出的后期元素。

You probably should take look at the documentation , where there are various concepts of time in Flink explained.您可能应该查看文档,其中解释了 Flink 中的各种时间概念。

The watermark approach looks very suspicious to me.水印方法在我看来非常可疑。 Usually, you would output the latest event timestamp at this point.通常,此时您会输出最新的事件时间戳。

Just some background information, so that it's easier to understand.只是一些背景信息,以便更容易理解。

Late events refer to events that come after the watermark processed to a time after the event.延迟事件是指在水印处理之后到事件之后的时间发生的事件。 Consider the following example:考虑以下示例:

event1 @time 1
event2 @time 2
watermark1 @time 3
event3 @time 1 <-- late event
event4 @time 4

Your watermark approach would pretty much render all past events as late events (a bit of tolerance because of the 1s watermark interval).您的水印方法几乎会将所有过去的事件呈现为后期事件(由于 1 秒水印间隔,因此有点宽容)。 This would also make reprocessing and catchups impossible.这也将使再处理和追赶变得不可能。

However, you are actually not seeing any late events which is even more surprising to me.但是,您实际上没有看到任何晚事件,这对我来说更令人惊讶。 Can you double-check your watermark approach, describe your use case, and provide example data?您能否仔细检查您的水印方法、描述您的用例并提供示例数据? Often times, the implementation is not ideal for the actual use case and it should be solved in a different way.很多时候,实现对于实际用例并不理想,应该以不同的方式解决。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM