简体   繁体   English

如何在使用 coGroup() 加入 Flink 后排空窗口?

[英]How to drain the window after a Flink join using coGroup()?

I'd like to join data coming in from two Kafka topics ("left" and "right").我想加入来自两个 Kafka 主题(“左”和“右”)的数据。

Matching records are to be joined using an ID, but if a "left" or a "right" record is missing, the other one should be passed downstream after a certain timeout.匹配记录将使用 ID 连接,但如果缺少“左”或“右”记录,则应在一定超时后将另一记录传递到下游。 Therefore I have chosen to use the coGroup function.因此我选择使用coGroup功能。

This works, but there is one problem: If there is no message at all, there is always at least one record which stays in an internal buffer for good.这可行,但有一个问题:如果根本没有消息,则始终至少有一条记录永久保留在内部缓冲区中。 It gets pushed out when new messages arrive.当新消息到达时它会被推出。 Otherwise it is stuck.否则会卡住。

The expected behaviour is that all records should be pushed out after the configured idle timeout has been reached.预期的行为是在达到配置的空闲超时后应该推出所有记录。

Some information which might be relevant一些可能相关的信息

  • Flink 1.14.4 Flink 1.14.4
  • The Flink parallelism is set to 8, so is the number of partitions in both Kafka topics. Flink 并行度设置为 8,两个 Kafka 主题中的分区数也是如此。
  • Flink checkpointing is enabled Flink 检查点已启用
  • Event-time processing is to be used将使用事件时间处理
  • Lombok is used: So val is like final var使用 Lombok:所以val就像final var

Some code snippets:一些代码片段:

Relevant join settings相关加入设置

public static final int AUTO_WATERMARK_INTERVAL_MS = 500;

public static final Duration SOURCE_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(4000);
public static final Duration SOURCE_IDLE_TIMEOUT = Duration.ofMillis(1000);

public static final Duration TRANSFORMATION_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(5000);
public static final Duration TRANSFORMATION_IDLE_TIMEOUT = Duration.ofMillis(1000);

public static final Time JOIN_WINDOW_SIZE = Time.milliseconds(1500);

Create KafkaSource创建KafkaSource

private static KafkaSource<JoinRecord> createKafkaSource(Config config, String topic) {
    val properties = KafkaConfigUtils.createConsumerConfig(config);

    val deserializationSchema = new KafkaRecordDeserializationSchema<JoinRecord>() {
        @Override
        public void deserialize(ConsumerRecord<byte[], byte[]> record, Collector<JoinRecord> out) {
            val m = JsonUtils.deserialize(record.value(), JoinRecord.class);

            val copy = m.toBuilder()
                    .partition(record.partition())
                    .build();

            out.collect(copy);
        }

        @Override
        public TypeInformation<JoinRecord> getProducedType() {
            return TypeInformation.of(JoinRecord.class);
        }
    };

    return KafkaSource.<JoinRecord>builder()
            .setProperties(properties)
            .setBootstrapServers(config.kafkaBootstrapServers)
            .setTopics(topic)
            .setGroupId(config.kafkaInputGroupIdPrefix + "-" + String.join("_", topic))
            .setDeserializer(deserializationSchema)
            .setStartingOffsets(OffsetsInitializer.latest())
            .build();
}

Create DataStreamSource创建DataStreamSource

Then the DataStreamSource is built on top of the KafkaSource :然后DataStreamSource建立在KafkaSource

  • Configure "max out of orderness"配置“最大无序”
  • Configure "idleness"配置“空闲”
  • Extract timestamp from record, to be used for event time processing从记录中提取时间戳,用于事件时间处理
private static DataStreamSource<JoinRecord> createLeftSource(Config config,
                                                             StreamExecutionEnvironment env) {
    val leftKafkaSource = createLeftKafkaSource(config);

    val leftWms = WatermarkStrategy
            .<JoinRecord>forBoundedOutOfOrderness(SOURCE_MAX_OUT_OF_ORDERNESS)
            .withIdleness(SOURCE_IDLE_TIMEOUT)
            .withTimestampAssigner((joinRecord, __) -> joinRecord.timestamp.toEpochSecond() * 1000L);

    return env.fromSource(leftKafkaSource, leftWms, "left-kafka-source");
}

Use keyBy使用keyBy

The keyed sources are created on top of the DataSource instances like this:键控源是在DataSource实例之上创建的,如下所示:

  • Again configure "out of orderness" and "idleness"再次配置“乱序”和“空闲”

  • Again extract timestamp再次提取时间戳

     val leftWms = WatermarkStrategy .<JoinRecord>forBoundedOutOfOrderness(TRANSFORMATION_MAX_OUT_OF_ORDERNESS) .withIdleness(TRANSFORMATION_IDLE_TIMEOUT) .withTimestampAssigner((joinRecord, __) -> { if (VERBOSE_JOIN) log.info("Left : " + joinRecord); return joinRecord.timestamp.toEpochSecond() * 1000L; }); val leftKeyedSource = leftSource .keyBy(jr -> jr.id) .assignTimestampsAndWatermarks(leftWms) .name("left-keyed-source");

Join using coGroup使用coGroup加入

The join then combines the left and the right keyed sources然后,连接将左键源和右键组合在一起

    val joinedStream = leftKeyedSource
            .coGroup(rightKeyedSource)
            .where(left -> left.id)
            .equalTo(right -> right.id)
            .window(TumblingEventTimeWindows.of(JOIN_WINDOW_SIZE))
            .apply(new CoGroupFunction<JoinRecord, JoinRecord, JoinRecord>() {
                       @Override
                       public void coGroup(Iterable<JoinRecord> leftRecords, 
                                           Iterable<JoinRecord> rightRecords,
                                           Collector<JoinRecord> out) {
                           // Transform
                           val result = ...;

                           out.collect(result);
                       }

Write stream to console将流写入控制台

The resulting joinedStream is written to the console:生成的joinedStream被写入控制台:

    val consoleSink = new PrintSinkFunction<JoinRecord>();
    joinedStream.addSink(consoleSink);
  • How can I configure this join operation, so that all records are pushed downstream after the configured idle timeout?如何配置此连接操作,以便在配置的空闲超时后将所有记录推送到下游?
  • If it can't be done this way: Is there another option?如果不能这样做:还有其他选择吗?

This is the expected behavior.这是预期的行为。 withIdleness doesn't try to handle the case where all streams are idle. withIdleness不会尝试处理所有流都空闲的情况。 It only helps in cases where there are still events flowing from at least one source partition/shard/split.它仅在仍有事件从至少一个源分区/分片/拆分流出的情况下才有帮助。

To get the behavior you desire (in the context of a continuous streaming job), you'll have to implement a custom watermark strategy that advances the watermark based on a processing time timer.要获得您想要的行为(在连续流作业的上下文中),您必须实施自定义水印策略,该策略根据处理时间计时器推进水印。 Here's an implementation that uses the legacy watermark API. 这是一个使用旧版水印 API 的实现。

On the other hand, if the job is complete and you just want to drain the final results before shutting it down, you can use the --drain option when you stop the job.另一方面,如果作业已完成,并且您只想在关闭作业之前排空最终结果,则可以在停止作业时使用--drain选项。 Or if you use bounded sources this will happen automatically.或者,如果您使用有界来源,这将自动发生。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM