简体繁体 English

用Kafka流/ KSQL连接表吗？

[英]Join table with Kafka stream / KSQL?

原文 2019-07-07 12:52:21 9 2 apache-kafka/ apache-kafka-streams/ ksql

I'm importing a DB, which contains some link table representing both many-to-many and one-to-many relationship. 我正在导入一个数据库，其中包含一些表示多对多和一对多关系的链接表。

Let's focus for now on the one-to-many relationship. 现在让我们集中讨论一对多关系。 Eg A Biossay can have many document but a document can only have one BioAssay. 例如，一份生物测定可以有很多文件，但一个文件只能有一个生物测定。

Hence i have a table of BioAssay [BioAssay, ..., ..., ...] and a link table [Document, BioAssay]. 因此，我有一个BioAssay表[BioAssay，...，...，...]和一个链接表[Document，BioAssay]。

Ultimately i need to join those 2 into a full BioAssay with all its document eg [BioAssayxyz, ...., "Document1:Document2:Document3"] 最终，我需要将这2个文档及其全部文档加入完整的BioAssay中，例如[BioAssayxyz，....，“ Document1：Document2：Document3”]

I wonder if anyone here could provide me a sense of what needs to happen with Kafka stream ? 我想知道这里是否有人可以为我提供Kafka流需要发生的情况？

1 - So far based on my understanding of Kafka stream, it seems that i need a stream for each link table, in order to perform the aggregation. 1-到目前为止，根据我对Kafka流的了解，似乎我需要为每个链接表提供一个流，以便执行聚合。 KTable would not be usable because, records are updated per key. KTable将无法使用，因为记录是按键更新的。 The result of the aggregation could be a in a Ktable however. 但是，聚合的结果可能是在Ktable中。

2 - Then comes the problem of join on foreign keys. 2-然后是外键联接的问题。 It seems the only way to do that is trough GlobalKtable. 看来唯一的方法就是通过GlobalKtable。 link-table-topic -> link-table-stream->link-tableGlobaKTable. link-table-topic-> link-table-stream-> link-tableGlobaKTable。 This could result in a lot disk space usage as my table are very large. 由于我的表很大，因此可能会占用大量磁盘空间。 This is a super large DB with a lot of table, and that requirement of building several logical view on the data is part of the core of the project and can't be avoided. 这是一个具有大量表的超大型数据库，并且在数据上构建多个逻辑视图的要求是项目核心的一部分，无法避免。

a) Am I understanding it right here ? a）我在这里理解吗？

b) Is this the only way to tackle that ? b）这是解决该问题的唯一方法吗？

EDIT1 编辑1

Sounds like the only thing that exist is KStream-to-GlobalKTable, seems like i need to turn things upside down a little. 听起来唯一存在的就是KStream-to-GlobalKTable，似乎我需要将事情倒过来一点。 My original DB BioAssay Table, needs to be turned into a stream, while my link document table, need to be turn into a stream first for aggregation and then a GlobalKTable for joining. 我的原始DB BioAssay表需要转换为流，而我的链接文档表首先需要转换为流以进行聚合，然后需要转换为GlobalKTable以进行连接。

Either way, unless my streams only have one partition, this can be very expensive. 无论哪种方式，除非我的流仅具有一个分区，否则这将非常昂贵。

2 个解决方案

I happened to have worked on a similar use case with Kafka Streams a few month ago, I'm happy to share my learnings. 几个月前，我碰巧曾在Kafka Streams上处理过一个类似的用例，我很高兴分享自己的经验。

Using KStreams-to-KTable as you suggest would kinda work, although with some caveats that might not be acceptable to you. 按照您的建议使用KStreams-to-KTable可能会奏效，尽管有些注意事项可能对您来说是不可接受的。

First, recall that a stream-to-table join is only updated by Kafka Streams when a new event is received on the stream side, not on the ktable side. 首先，回想一下，只有当在流端而不是ktable端接收到新事件时，Kafka Streams才会更新流到表的连接。

Second, assuming you're using CDC in order to import the DB, then my understanding is that you do not have guarantees on the order in which updates lands on Kafka. 其次，假设您使用CDC来导入数据库，那么我的理解是您无法保证更新在Kafka上的顺序。 That means that even if you enjoy transaction isolation on DB side that make appear an update or insert on tables Document and BioAssay "all at once", on Kafka side you'd receive one, and then the other, in arbitrary order. 这意味着即使您在数据库方面享受事务隔离，从而使“文档和BioAssay”表一次全部更新或插入到表中，但在Kafka方面，您会以任意顺序收到一个，然后收到另一个。

The two points above hopefully make clear why the join result on Kafka Streams side might not reflect the DB content as you'd expect. 以上两点希望可以清楚地说明为什么Kafka Streams端的联接结果可能无法反映您期望的数据库内容。

The solution I took was to go "under the hood" and join my streams manually using the Processor API. 我采取的解决方案是“深入了解”并使用Processor API手动加入我的流。 This allowed to achieve table-to-table join semantic, updated whenever either side is updated. 这允许实现表对表的连接语义，每当更新任一侧时都进行更新。 I described the core idea in that blog post: 我在该博客文章中描述了核心思想：

https://svend.kelesia.com/one-to-many-kafka-streams-ktable-join.html https://svend.kelesia.com/one-to-many-kafka-streams-ktable-join.html

Using that technique, I was able to import correctly both one-to-many and many-to-many relationships from DB. 使用该技术，我能够从DB正确导入一对多和多对多关系。

如果您的表共享相同的键（即外键），则可以利用它来发挥优势，并将所有表流式传输到同一主题（可以使用多个分区进行扩展）。