合并多个相同的Kafka Streams主题

Question

I have 2 Kafka topics streaming the exact same content from different sources so I can have high availability in case one of the sources fails. 我有2个Kafka主题，可从不同来源流式传输完全相同的内容，因此，如果其中一个来源发生故障，我将具有很高的可用性。 I'm attempting to merge the 2 topics into 1 output topic using Kafka Streams 0.10.1.0 such that I don't miss any messages on failures and there are no duplicates when all sources are up. 我正在尝试使用Kafka Streams 0.10.1.0将2个主题合并为1个输出主题，这样我就不会错过任何关于失败的消息，并且当所有来源都打开时就不会重复。

When using the leftJoin method of KStream, one of the topics can go down with no problem (the secondary topic), but when the primary topic goes down, nothing is sent to the output topic. 当使用leftJoin方法时，一个主题可以毫无问题地关闭（第二个主题），但是当主主题关闭时，没有任何内容发送到输出主题。 This seems to be because, according to the Kafka Streams developer guide , 根据Kafka Streams开发人员指南，这似乎是因为，

KStream-KStream leftJoin is always driven by records arriving from the primary stream KStream-KStream leftJoin始终由来自主流的记录驱动

so if there are no records coming from the primary stream, it will not use the records from the secondary stream even if they exist. 因此，如果没有来自主流的记录，则即使存在，它也不会使用来自辅助流的记录。 Once the primary stream comes back online, output resumes normally. 一旦主流返回在线状态，输出将正常恢复。

I've also tried using outerJoin (which adds duplicate records) followed by a conversion to a KTable and groupByKey to get rid of duplicates, 我还尝试了使用outerJoin （添加重复记录），然后转换为KTable和groupByKey来消除重复项，

KStream mergedStream = stream1.outerJoin(stream2, 
    (streamVal1, streamVal2) -> (streamVal1 == null) ? streamVal2 : streamVal1,
    JoinWindows.of(2000L))

mergedStream.groupByKey()
            .reduce((value1, value2) -> value1, TimeWindows.of(2000L), stateStore))
            .toStream((key,value) -> value)
            .to(outputStream)

but I still get duplicates once in a while. 但我仍然偶尔会得到重复。 I'm also using commit.interval.ms=200 to get the KTable to send to the output stream often enough. 我还使用commit.interval.ms=200来使KTable足够频繁地发送到输出流。

What would be the best way to approach this merge to get exactly-once output from multiple identical input topics? 进行此合并以从多个相同输入主题中获取一次准确输出的最佳方法是什么？

Answer 1

Using any kind of join will not solve your problem, as you will always end up with either missing result (inner-join in case some streams stalls) or "duplicates" with null (left-join or outer-join in case both streams are online). 使用任何一种连接都无法解决您的问题，因为您总是会丢失结果（如果某些流停顿了，则是内部联接）或带有null “重复项”（如果两个流均是左联接或外联接）线上）。 See https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics for details on join semantics in Kafka Streams. 有关Kafka Streams中连接语义的详细信息，请参见https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics 。

Thus, I would recommend to use Processor API that you can mix-and-match with DSL using KStream process() , transform() , or transformValues() . 因此，我建议使用Processor API，您可以使用KStream process() ， transform()或transformValues()与DSL进行KStream 。 See How to filter keys and value with a Processor using Kafka Stream DSL for more details. 有关更多详细信息，请参见如何使用处理器使用Kafka Stream DSL过滤键和值。

You can also add a custom store to your processor ( How to add a custom StateStore to the Kafka Streams DSL processor? ) to make duplicate-filtering fault-tolerant. 您还可以将自定义存储添加到处理器（如何将自定义StateStore添加到Kafka Streams DSL处理器？），以使重复过滤容错。

合并多个相同的Kafka Streams主题

问题描述

1 个解决方案

解决方案1
7 已采纳 2016-11-28 17:40:19

合并多个相同的Kafka Streams主题

问题描述

1 个解决方案

解决方案1 7 已采纳 2016-11-28 17:40:19

解决方案1
7 已采纳 2016-11-28 17:40:19