简体繁体 English

在这个例子中使用 Kafka 有什么意义，为什么不直接使用 DB？

[英]What is the point of using Kafka in this example and why not use DB straightaway?

原文 2022-08-10 21:22:22 3 1 apache-kafka/ architecture

Here is an example of how Kafka should run for a Social network site.以下是 Kafka 应如何为社交网站运行的示例。 But it is hard for me to understand the point of Kafka here.但是我很难理解卡夫卡在这里的意义。 We would not want to store posts and likes in Kafka as they will be destroyed after some time.我们不想在 Kafka 中存储帖子和点赞，因为它们会在一段时间后被销毁。 So kafka should be an intermediate storage between View and DB.所以kafka应该是View和DB之间的中间存储。 But why would we need it?但我们为什么需要它？ Wouldn't it be better to use DB straightaway.直接使用 DB 不是更好吗？

I guess that we could use kafka as some kind of cache so the data accumulates in Kafka and then we can insert it to DB in one big batch query.我想我们可以将 kafka 用作某种缓存，以便数据在 Kafka 中累积，然后我们可以将其插入到 DB 中的一个大批量查询中。 But I am pretty sure that is not the reason kafka here.但我很确定这不是卡夫卡的原因。

1 个解决方案

What's not shown in the diagram is the processes querying the database (RocksDB, in this case).图中未显示的是查询数据库的进程（在本例中为 RocksDB）。 Without using Kafka Streams , you'd need to write some external service to run GROUP BY / SUM on the database.如果不使用 Kafka Streams ，您需要编写一些外部服务来在数据库上运行 GROUP BY / SUM。
With Kafka Streams Interactive Queries , that logic can be moved closer to the actual event source, and is performed in near real time, rather than a polling batch.使用 Kafka Streams Interactive Queries ，该逻辑可以更接近实际事件源，并且可以近乎实时地执行，而不是轮询批处理。 In a streaming framework, you could also send out individual event hooks (websockets, for example) to dynamically update "likes per post ", "shares per post ", "trends", etc without needing the user to update the page, or have the page load AJAX calls with large API responses for those details for all page rendered items.在流式传输框架中，您还可以发送单个事件挂钩（例如 websockets）来动态更新“每个帖子的点赞数”、“每个帖子的分享数”、“趋势”等，而无需用户更新页面，或者页面加载 AJAX 调用带有大型 API 响应，以获取所有页面呈现项目的详细信息。

More specifically, each Kafka Stream instance serves a specific query, rather than the API hitting one database for all queries.更具体地说，每个 Kafka Stream 实例服务于一个特定的查询，而不是 API 为所有查询命中一个数据库。 Therefore, load is more distributed and fault tolerant.因此，负载更分散且容错。

Worth pointing out that Apache Pinot loaded from Kafka is more suited for such real time analytical queries than Kafka Streams.值得指出的是，从 Kafka 加载的 Apache Pinot 比 Kafka Streams 更适合这种实时分析查询。

Also as you pointed out, Kafka or any message queue would act as a buffer ahead of any database (not a cache, although, Redis could be added as a cache, just like the later mentioned search service).此外，正如您所指出的，Kafka 或任何消息队列将充当任何数据库之前的缓冲区（不是缓存，尽管 Redis 可以添加为缓存，就像后面提到的搜索服务一样）。 And there's nothing preventing you from adding another database that's connected to Kafka Connect sink.并且没有什么可以阻止您添加连接到 Kafka Connect 接收器的另一个数据库。 For instance, a popular design is to write data to a RDBMS as well as Elasticsearch for text based search-indexing.例如，一种流行的设计是将数据写入 RDBMS 以及 Elasticsearch 以进行基于文本的搜索索引。 The producer code only cares about one Kafka topic, not every downstream system where the data is needed.生产者代码只关心一个 Kafka 主题，而不是需要数据的每个下游系统。