简体繁体 English

Kafka 动态间接寻址，带有类似于消息传递的发布/订阅

[英]Kafka dynamic indirection with pub/sub like messaging

原文 2022-02-13 13:17:17 5 1 apache-kafka/ apache-pulsar

I am new to Kafka, but have already a quite challenging problem to solve.我是 Kafka 的新手，但已经有一个非常具有挑战性的问题需要解决。

Before describing the problem: My application is about spatial indexing and geographic data coordination.在描述问题之前：我的应用程序是关于空间索引和地理数据协调的。 Meaning, I really need the described type of re-routing for good reasons.意思是，出于充分的理由，我确实需要所描述的重新路由类型。

I need to achieve the following event flow:我需要实现以下事件流程：

there are n instances of my application, and a large variable amount of data objects in total globally.我的应用程序有 n 个实例，并且在全球范围内总共有大量可变的数据对象。
each instance knows only about a subset of global objects.每个实例只知道全局对象的一个子集。 But there can be multiple instances knowing about the same data object.但是可以有多个实例知道相同的数据 object。
However, in case a global object is changed (including ones that are unknown at this point) the change must be propagated to all instances which know this data object. In that sense, „instances“ are subscribers to certain data objects.但是，如果全局 object 发生更改（包括此时未知的），则必须将更改传播到知道此数据 object 的所有实例。从这个意义上说，“实例”是某些数据对象的订阅者。

Question 1: Is it wise to use Kafka with log compaction enabled to maintain a strongly consistent list of subscribers to an object?问题 1：使用启用了日志压缩的 Kafka 来维护 object 的高度一致的订阅者列表是否明智？ Eg:例如：

A topic named changeevents where all instances may publish changed data as required for this edge case.一个名为 changeevents 的主题，其中所有实例都可以根据此边缘案例的需要发布更改的数据。
A topic named pubsub with a map from object id to subscriber topic:一个名为 pubsub 的主题，其 map 从 object id 到订阅者主题：

ObjectId:[subscriberId1Tooic, subscriberId2Topic] ObjectId:[subscriberId1Tooic, subscriberId2Topic]

Question 2: What choices do I have in Kafka to make this re-routing work in the most scalable and low latency way?问题 2：我在 Kafka 中有哪些选择可以使这种重新路由以最具可扩展性和低延迟的方式工作？ Is it possible to create dynamic routing events in-place, eg having a stream of change events and Kafka places the change event to all subscriber topics?是否可以就地创建动态路由事件，例如，有一个 stream 的更改事件，Kafka 将更改事件放置到所有订阅者主题？

Question 3: This seems all a bit complicated.问题 3：这似乎有点复杂。 My scenario is pretty unique, yet I hope I'm missing something to make this less complicated?我的场景非常独特，但我希望我遗漏了一些东西来使它变得不那么复杂？

It would be a valid question to ask at this point why I chose to use Kafka for what looks like a publish/subscribe problem.在这一点上问为什么我选择使用 Kafka 来解决看起来像发布/订阅问题的问题是一个有效的问题。 First, the data flow between the backend instances is in the common case not requiring this type of re-routing and the problem above is solving a < 1% case of total data to be processed.首先，后端实例之间的数据流在通常情况下不需要这种类型的重新路由，而上述问题是解决要处理的总数据量 < 1% 的情况。 Secondly, I'm also investigating Apache Pulsar which seems to have better support for publish/subscribe scenarios.其次，我也在研究 Apache Pulsar，它似乎对发布/订阅场景有更好的支持。 Where I struggle with is that my application is deployed by customers, and Pulsar has a far lower adoption / acceptance rate.我纠结的地方是我的应用程序是由客户部署的，而 Pulsar 的采用率/接受率要低得多。

I researched about routing options in Kafka, and the closest I could find to this problem seems dynamic routing as described here https://www.confluent.io/blog/putting-events-in-their-place-with-dynamic-routing/我研究了 Kafka 中的路由选项，我能找到的最接近这个问题的似乎是动态路由，如此处所述https://www.confluent.io/blog/putting-events-in-their-place-with-dynamic-routing /

As I see it, I need to have an additional data source to maintain the pub/sub list, additional custom processors that place the messages to related subscriber topics, also all at the cost of duplicating messages.正如我所见，我需要一个额外的数据源来维护发布/订阅列表，额外的自定义处理器将消息放置到相关的订阅者主题，所有这些都以复制消息为代价。

1 个解决方案

Sure, a compacted topic seems like a reasonable approach if you want eventually consistent, unique data.当然，如果您想要最终一致的、唯一的数据，那么紧凑的主题似乎是一种合理的方法。 You'll need a GlobalKTable like the linked post says to query against that data, though.不过，您需要一个 GlobalKTable，就像链接的帖子所说的那样，以查询该数据。

"Topics" themselves do not subscribe. “主题”本身不订阅。 You'd need a consumer that reads all "changeevents", then filters/branches into the downstream "client topics".您需要一个消费者读取所有“changevents”，然后过滤/分支到下游“客户端主题”。 This will likely be the largest bottleneck, and main way to scale it would be lots of partitions and record keys that map somehow to geographic regions, or some other unique identification.这可能是最大的瓶颈，扩展它的主要方式是大量分区和记录密钥，这些密钥 map 以某种方式指向地理区域，或其他一些唯一标识。

It's unclear what your output data looks like;不清楚您的 output 数据是什么样的； if you're publishing notifications or displaying some kind of map, you'll need another system listening to only "local" geo-fenced events.如果您要发布通知或显示某种类型的 map，您将需要另一个系统来仅侦听“本地”地理围栏事件。 If you need to read those events multiple times, you'd probably need more than Kafka, such as Elasticsearch geospatial queries, if not another system that supports geopoint data (I recall there being a GIS plugin for Postgres).如果您需要多次读取这些事件，您可能需要的不仅仅是 Kafka，例如 Elasticsearch 地理空间查询，如果不是另一个支持地理点数据的系统（我记得有一个用于 Postgres 的 GIS 插件）。 Eg Use Kafka Connect to write into that system, and anything that needs those events, just queries the database.例如，使用 Kafka Connect 写入该系统，任何需要这些事件的东西，只需查询数据库即可。