[英]How to drain the window after a Flink join using coGroup()?
I'd like to join data coming in from two Kafka topics ("left" and "right").我想加入来自两个 Kafka 主题(“左”和“右”)的数据。
Matching records are to be joined using an ID, but if a "left" or a "right" record is missing, the other one should be passed downstream after a certain timeout.匹配记录将使用 ID 连接,但如果缺少“左”或“右”记录,则应在一定超时后将另一记录传递到下游。 Therefore I have chosen to use the
coGroup
function.因此我选择使用
coGroup
功能。
This works, but there is one problem: If there is no message at all, there is always at least one record which stays in an internal buffer for good.这可行,但有一个问题:如果根本没有消息,则始终至少有一条记录永久保留在内部缓冲区中。 It gets pushed out when new messages arrive.
当新消息到达时它会被推出。 Otherwise it is stuck.
否则会卡住。
The expected behaviour is that all records should be pushed out after the configured idle timeout has been reached.预期的行为是在达到配置的空闲超时后应该推出所有记录。
Some information which might be relevant一些可能相关的信息
val
is like final var
val
就像final var
Some code snippets:一些代码片段:
public static final int AUTO_WATERMARK_INTERVAL_MS = 500;
public static final Duration SOURCE_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(4000);
public static final Duration SOURCE_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Duration TRANSFORMATION_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(5000);
public static final Duration TRANSFORMATION_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Time JOIN_WINDOW_SIZE = Time.milliseconds(1500);
KafkaSource
KafkaSource
private static KafkaSource<JoinRecord> createKafkaSource(Config config, String topic) {
val properties = KafkaConfigUtils.createConsumerConfig(config);
val deserializationSchema = new KafkaRecordDeserializationSchema<JoinRecord>() {
@Override
public void deserialize(ConsumerRecord<byte[], byte[]> record, Collector<JoinRecord> out) {
val m = JsonUtils.deserialize(record.value(), JoinRecord.class);
val copy = m.toBuilder()
.partition(record.partition())
.build();
out.collect(copy);
}
@Override
public TypeInformation<JoinRecord> getProducedType() {
return TypeInformation.of(JoinRecord.class);
}
};
return KafkaSource.<JoinRecord>builder()
.setProperties(properties)
.setBootstrapServers(config.kafkaBootstrapServers)
.setTopics(topic)
.setGroupId(config.kafkaInputGroupIdPrefix + "-" + String.join("_", topic))
.setDeserializer(deserializationSchema)
.setStartingOffsets(OffsetsInitializer.latest())
.build();
}
DataStreamSource
DataStreamSource
Then the DataStreamSource
is built on top of the KafkaSource
:然后
DataStreamSource
建立在KafkaSource
:
private static DataStreamSource<JoinRecord> createLeftSource(Config config,
StreamExecutionEnvironment env) {
val leftKafkaSource = createLeftKafkaSource(config);
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(SOURCE_MAX_OUT_OF_ORDERNESS)
.withIdleness(SOURCE_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> joinRecord.timestamp.toEpochSecond() * 1000L);
return env.fromSource(leftKafkaSource, leftWms, "left-kafka-source");
}
keyBy
keyBy
The keyed sources are created on top of the DataSource
instances like this:键控源是在
DataSource
实例之上创建的,如下所示:
Again configure "out of orderness" and "idleness"再次配置“乱序”和“空闲”
Again extract timestamp再次提取时间戳
val leftWms = WatermarkStrategy .<JoinRecord>forBoundedOutOfOrderness(TRANSFORMATION_MAX_OUT_OF_ORDERNESS) .withIdleness(TRANSFORMATION_IDLE_TIMEOUT) .withTimestampAssigner((joinRecord, __) -> { if (VERBOSE_JOIN) log.info("Left : " + joinRecord); return joinRecord.timestamp.toEpochSecond() * 1000L; }); val leftKeyedSource = leftSource .keyBy(jr -> jr.id) .assignTimestampsAndWatermarks(leftWms) .name("left-keyed-source");
coGroup
coGroup
加入The join then combines the left and the right keyed sources然后,连接将左键源和右键源组合在一起
val joinedStream = leftKeyedSource
.coGroup(rightKeyedSource)
.where(left -> left.id)
.equalTo(right -> right.id)
.window(TumblingEventTimeWindows.of(JOIN_WINDOW_SIZE))
.apply(new CoGroupFunction<JoinRecord, JoinRecord, JoinRecord>() {
@Override
public void coGroup(Iterable<JoinRecord> leftRecords,
Iterable<JoinRecord> rightRecords,
Collector<JoinRecord> out) {
// Transform
val result = ...;
out.collect(result);
}
The resulting joinedStream
is written to the console:生成的
joinedStream
被写入控制台:
val consoleSink = new PrintSinkFunction<JoinRecord>();
joinedStream.addSink(consoleSink);
This is the expected behavior.这是预期的行为。
withIdleness
doesn't try to handle the case where all streams are idle. withIdleness
不会尝试处理所有流都空闲的情况。 It only helps in cases where there are still events flowing from at least one source partition/shard/split.它仅在仍有事件从至少一个源分区/分片/拆分流出的情况下才有帮助。
To get the behavior you desire (in the context of a continuous streaming job), you'll have to implement a custom watermark strategy that advances the watermark based on a processing time timer.要获得您想要的行为(在连续流作业的上下文中),您必须实施自定义水印策略,该策略根据处理时间计时器推进水印。 Here's an implementation that uses the legacy watermark API.
这是一个使用旧版水印 API 的实现。
On the other hand, if the job is complete and you just want to drain the final results before shutting it down, you can use the --drain
option when you stop the job.另一方面,如果作业已完成,并且您只想在关闭作业之前排空最终结果,则可以在停止作业时使用
--drain
选项。 Or if you use bounded sources this will happen automatically.或者,如果您使用有界来源,这将自动发生。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.