将消息从一个Kafka群集流式传输到另一个群集

Question

I'm currently trying to, easily, stream messages from a Topic on one Kafka cluster to another one (Remote -> Local Cluster). 我目前正在尝试轻松地将消息从一个Kafka集群上的主题流式传输到另一个Kafka集群（远程 - >本地集群）。
The idea is to use Kafka-Streams right away so that we don't need to replicate the actual messages on the local cluster but only get the "results" of the Kafka-Streams processing into our Kafka-Topics. 我们的想法是立即使用Kafka-Streams，这样我们就不需要复制本地集群上的实际消息，而只需将Kafka-Streams处理的“结果”处理到我们的Kafka-Topics中。

So let's say the WordCount demo is on one Kafka-Instance on another PC than my own. 所以，让我们说WordCount演示在另一台PC上的Kafka-Instance上，而不是我自己的。 I also have a Kafka-Instance running on my local machine. 我还在我的本地计算机上运行了Kafka-Instance。
Now I want to let the WordCount demo run on the Topic ("remote") containing the sentences which words should be counted. 现在我想让WordCount演示在主题（“远程”）上运行，其中包含应该计算单词的句子。
The counting however should be written to a Topic on my local system instead of a "remote" Topic. 然而，计数应写入我本地系统的主题，而不是“远程”主题。

Is something like this doable with the Kafka-Streams API? 使用Kafka-Streams API是否可以这样做？
Eg 例如

val builder: KStreamBuilder = new KStreamBuilder(remote-streamConfig, local-streamconfig)
val textLines: KStream[String, String] = builder.stream("remote-input-topic", 
remote-streamConfig)
val wordCounts: KTable[String, Long] = textLines
    .flatMapValues(textLine => textLine.toLowerCase.split("\\W+").toIterable.asJava)
    .groupBy((_, word) => word)
    .count("word-counts")

wordCounts.to(stringSerde, longSerde, "local-output-topic", local-streamconfig)

val streams: KafkaStreams = new KafkaStreams(builder)
streams.start()

Thank you very much 非常感谢你
- Tim - 蒂姆

Answer 1

Kafka Streams is build for single cluster only. Kafka Streams仅针对单个群集构建。

A workaround is to use a foreach() or similar and instantiate your own KafkaProducer that write to the target cluster. 解决方法是使用foreach()或类似方法并实例化您自己的写入目标集群的KafkaProducer 。 Note, that your own producer must use sync writes! 请注意，您自己的制作人必须使用同步写入！ Otherwise, you might loose data in case of failure. 否则，如果发生故障，您可能会丢失数据。 Thus, it's not a very performant solution. 因此，它不是一个非常高效的解决方案。

It's better to just write the result to the source cluster and replicate the data to the target cluster. 最好将结果写入源集群并将数据复制到目标集群。 Note, that you most likely can use a much shorter retention period of the output topic in the source cluster, as the actual data is stored with longer retention time in the target cluster anyway. 请注意，您最有可能在源群集中使用更短的输出主题保留期，因为实际数据在目标群集中的保留时间更长。 This allows, you limit required storage on the source cluster. 这允许您限制源群集上的所需存储。

Edit (reply to comment below from @quickinsights) 编辑（回复以下来自@quickinsights的评论）

what if your Kafka streams service is down for longer period than the retention 如果您的Kafka流服务停留时间比保留时间长，该怎么办？

That seems to be an orthogonal concern, that can be raised for any design. 这似乎是一个正交的问题，可以针对任何设计提出。 Retention time should be set depending on you maximum downtime to avoid data loss in general. 应根据最大停机时间设置保留时间，以避免一般数据丢失。 Note thought, that because the application reads/write from/to the source cluster, and the source cluster output topic may be configures with a small retention time, nothing bad happens if the application goes down. 请注意，因为应用程序从/向源集群读取/写入，并且源集群输出主题可能配置的保留时间很短，所以如果应用程序出现故障，则不会发生任何不良情况。 The input topic will not be processed and no new output data is produced. 将不处理输入主题，也不会生成新的输出数据。 You might only worry about the case, for which your replication pipeline into the target cluster goes down -- you should set the retention time of the output topic in the source cluster accordingly to make sure you don't loose any data. 您可能只担心进入目标集群的复制管道出现故障的情况 - 您应该相应地在源集群中设置输出主题的保留时间，以确保不会丢失任何数据。

It also doubles your writes back to Kafka. 它还会将你写回Kafka的内容加倍。

Yes. 是。 It also increases storage footprint on disk. 它还增加了磁盘上的存储空间。 It's tradeoff (as always) between application resilience and runtime performance vs. cluster load. 它是应用程序弹性和运行时性能与集群负载之间的权衡（一如既往）。 Your choice. 你的选择。 I would personally recommend to go with the more resilient option as pointed out above. 我个人建议采用上面指出的更具弹性的选项。 It's easier to scale out your Kafka cluster than handling all the resilience edge cases in your application code. 扩展Kafka集群比处理应用程序代码中的所有弹性边缘情况更容易。

That seems super inefficient 这似乎超级低效

That's a personal judgment call. 这是个人判断。 It's a tradeoff and there is no objective right or wrong. 这是一种权衡，没有客观的对错。

将消息从一个Kafka群集流式传输到另一个群集

问题描述

1 个解决方案

解决方案1
6 已采纳 2017-12-15 06:25:14

将消息从一个Kafka群集流式传输到另一个群集

问题描述

1 个解决方案

解决方案1 6 已采纳 2017-12-15 06:25:14

解决方案1
6 已采纳 2017-12-15 06:25:14