简体   繁体   中英

Join table with Kafka stream / KSQL?

I'm importing a DB, which contains some link table representing both many-to-many and one-to-many relationship.

Let's focus for now on the one-to-many relationship. Eg A Biossay can have many document but a document can only have one BioAssay.

Hence i have a table of BioAssay [BioAssay, ..., ..., ...] and a link table [Document, BioAssay].

Ultimately i need to join those 2 into a full BioAssay with all its document eg [BioAssayxyz, ...., "Document1:Document2:Document3"]

I wonder if anyone here could provide me a sense of what needs to happen with Kafka stream ?

1 - So far based on my understanding of Kafka stream, it seems that i need a stream for each link table, in order to perform the aggregation. KTable would not be usable because, records are updated per key. The result of the aggregation could be a in a Ktable however.

2 - Then comes the problem of join on foreign keys. It seems the only way to do that is trough GlobalKtable. link-table-topic -> link-table-stream->link-tableGlobaKTable. This could result in a lot disk space usage as my table are very large. This is a super large DB with a lot of table, and that requirement of building several logical view on the data is part of the core of the project and can't be avoided.

a) Am I understanding it right here ?

b) Is this the only way to tackle that ?

EDIT1

Sounds like the only thing that exist is KStream-to-GlobalKTable, seems like i need to turn things upside down a little. My original DB BioAssay Table, needs to be turned into a stream, while my link document table, need to be turn into a stream first for aggregation and then a GlobalKTable for joining.

Either way, unless my streams only have one partition, this can be very expensive.

I happened to have worked on a similar use case with Kafka Streams a few month ago, I'm happy to share my learnings.

Using KStreams-to-KTable as you suggest would kinda work, although with some caveats that might not be acceptable to you.

First, recall that a stream-to-table join is only updated by Kafka Streams when a new event is received on the stream side, not on the ktable side.

Second, assuming you're using CDC in order to import the DB, then my understanding is that you do not have guarantees on the order in which updates lands on Kafka. That means that even if you enjoy transaction isolation on DB side that make appear an update or insert on tables Document and BioAssay "all at once", on Kafka side you'd receive one, and then the other, in arbitrary order.

The two points above hopefully make clear why the join result on Kafka Streams side might not reflect the DB content as you'd expect.

The solution I took was to go "under the hood" and join my streams manually using the Processor API. This allowed to achieve table-to-table join semantic, updated whenever either side is updated. I described the core idea in that blog post:

https://svend.kelesia.com/one-to-many-kafka-streams-ktable-join.html

Using that technique, I was able to import correctly both one-to-many and many-to-many relationships from DB.

如果您的表共享相同的键(即外键),则可以利用它来发挥优势,并将所有表流式传输到同一主题(可以使用多个分区进行扩展)。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM