简体   繁体   English

Kafka:加入事件形成多个主题

[英]Kafka: joining events form multiple topics

We are using Kafka in over a year now and want to move on with a deeper integration.我们使用 Kafka 已经一年多了,并希望继续进行更深入的集成。 But there is one concept I struggle with.但是有一个概念我很纠结。 I will try to explain what we want to achieve and how the solution looks like we came up with.我将尝试解释我们想要实现的目标以及我们提出的解决方案的样子。 The issue from my perspective, it's not really an elegant way and that's why I'm questioning if I got it right.从我的角度来看,这并不是一种优雅的方式,这就是为什么我质疑我是否做对了。

The Problem问题

We have one stream with complex structured events (nested structure).我们有一个 stream 具有复杂的结构化事件(嵌套结构)。 A consumer is taking those events, ripping them apart and is putting those pieces in separate topics.消费者正在处理这些事件,将它们拆开,并将这些片段放在不同的主题中。 Behind each topic are a lot of different other service who enrich the events flowing through the topics in a stream processing fashion.每个主题背后都有许多不同的其他服务,它们以 stream 处理方式丰富流经主题的事件。 At the end we have a number of topics, each with partial enriched events and we want to move them all together to bring them back as one complete event at the end of the entire process.最后,我们有许多主题,每个主题都有部分丰富的事件,我们希望将它们全部移动到一起,以便在整个过程结束时将它们作为一个完整的事件带回来。 But this easier said than done.但这说起来容易做起来难。

The Solution解决方案

At the end we have one service that consumes all the topics and builds a buffer until all the partial events flown in and puts them together to publish it in a new topic.最后,我们有一个服务使用所有主题并构建缓冲区,直到所有部分事件都流入并将它们放在一起以将其发布到一个新主题中。 The challenge for is to make sure we only produce events in the final topics that are complete.挑战在于确保我们只在完成的最终主题中制作事件。 This works but has some pitfalls这可行,但有一些缺陷

  • the buffer can not be internal, it has to be something external that multiple consumer can share informations缓冲区不能是内部的,它必须是多个消费者可以共享信息的外部东西
  • we can theoretically run in timing issues and create dead entries理论上我们可以运行时序问题并创建死条目
  • we can have consistency issues我们可能会遇到一致性问题
  • and so on等等

My Question我的问题

Even if it works and I don't think this is a very elegant way, are we on the right track or did we misunderstood something in the concepts and handling of Kafka events and stream processing?即使它有效并且我认为这不是一种非常优雅的方式,我们是否走在正确的轨道上,或者我们是否误解了 Kafka 事件的概念和处理以及 stream 处理? Are there any better ways to do it?有没有更好的方法来做到这一点? Has somebody experience with this and can share some leanings or ways to integrate it in stable way?有人有这方面的经验,可以分享一些倾向或以稳定的方式整合它的方法吗?

Thanks.谢谢。 Any comments are very appreciated.非常感谢任何意见。

We have gone through a very similar use case and architecture.我们经历了一个非常相似的用例和架构。 I understand you are splitting the initial nested message into multiple topics to increase the parallelism and therefore increase the throughput.我了解您正在将初始嵌套消息拆分为多个主题以增加并行度,从而增加吞吐量。

In our experience, this leads to a highly complex architecture because joining streams (as you have already described) can be very hard to operate.根据我们的经验,这会导致一个高度复杂的架构,因为加入流(如您已经描述的那样)可能非常难以操作。 The main issues we had were:我们遇到的主要问题是:

  • If one of the enrichment jobs is failing, how long should all the other messages wait?如果其中一个充实作业失败,所有其他消息应该等待多长时间? It could take a few hours or a few days to fix the bug.修复错误可能需要几个小时或几天。
  • If those enrichment jobs depend on external system and they are not reachable for some time.如果这些充实作业依赖于外部系统并且它们在一段时间内无法访问。 How long should you wait to make them available again?您应该等多久才能再次使用它们?

In our experience the current approach you are describing brings in a lot of complexity and (sometimes uncontrollable) dependencies.根据我们的经验,您所描述的当前方法会带来很多复杂性和(有时无法控制的)依赖性。

In the end we kept all the data together and significantly increase the partitions of the topic to increase the throughput.最后,我们将所有数据放在一起,并显着增加主题的分区以提高吞吐量。 That way each message is consistent in itself and if there is a problem with any enrichment, the entire message is impacted and not just part of it.这样每条消息本身都是一致的,如果任何扩充出现问题,整个消息都会受到影响,而不仅仅是其中的一部分。 To reduce the complexity in a single job we have buffered interim data in Kafka topics, in your case this could mean to have the various enrichment job run in sequence and not in parallel with a topic in between them.为了降低单个作业的复杂性,我们在 Kafka 主题中缓冲了临时数据,在您的情况下,这可能意味着让各种丰富作业按顺序运行,而不是与它们之间的主题并行运行。 That way each job stays reasonable small and you can make use of the re-play functionality that comes with Kafka.这样每个作业都保持合理的小,您可以利用 Kafka 附带的重播功能。

Operating stream joins is one of the most sophisticated things to do and I recommend to avoid it as much as possible unless it is acceptable to have short term inconsistencies and you are not required to procces all messages but instead might drop the one or the other.操作 stream 连接是最复杂的事情之一,我建议尽可能避免它,除非短期不一致是可以接受的,并且您不需要处理所有消息,而是可能会丢弃一个或另一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM