简体   繁体   中英

In KSQL Stream <- Table left join, partial events are not joined properly

I'm trying to enrich some event data with KSQL(5.2.3)&Kafka(2.12-2.3.0).

Left Joining a stream with a table.

But the partial result of the join doesn't contain enriched data as I expected.

I figured out the problem.

The problem is that left joining is processed before the table loads related previous event.

To make the problem clear, I pasted Simplified KSQL Query and Event Data.

Events:

TimeStamp | EventType  | EventData
1         | Create     | ID:1, Name:"HELLO"
2         | Access     | ID:1, TID:2
3         | Write      | ID:1, TID:2
100       | Acesss     | ID:1, TID:3
110       | Write      | ID:1, TID:3

Stream&Table:

CREATE STREAM SUBJECT_CREATE (TIMESTAMP='TimeStamp') AS SELECT TimeStamp, ID, Name FROM EVENT_STREAM WHERE EventType='Create' PARTITION BY ID;
CREATE TABLE SUBJECT_CREATE_TABLE (*) WITH (KAFKA_TOPIC='SUBJECT_CREATE', KEY='ID') ;

CREATE STREAM SUBJECT_ACCESS (TIMESTAMP='TimeStamp') AS SELECT TimeStamp, ID, TID FROM EVENT_STREAM WHERE EventType='Access' PARTITION BY ID;
CREATE STREAM SUBJECT_CR_AC_JOIN WITH(TIMESTAMP='TimeStamp') AS SELECT N.TimeStamp AS TimeStamp, N.ID AS ID, N.TID AS TID, P.Name AS Name FROM SUBJECT_ACCESS N LEFT JOIN SUBJECT_CREATE_TABLE P ON N.ID = P.ID PARTITION BY ID;

Result of SUBJECT_CR_AC_JOIN Stream:

TimeStamp | ID | TID | Name
2         | 1  |  2  | null   ==> Expected "HELLO"
100       | 1  |  3  | "HELLO"

Second one contains 'Name', but first doesn't.

Is it possible to make them sync in KSQL?

Thank you.

ksqlDB will attempt to process your data ordered by its ROWTIME. So if your stream data has an earlier timestamp than the table data, then it is correctly not being joined to the table data. After all, the table data did not exist at the time the stream events happened.

This is by design.

You can work around this is you can produce the table data to Kafka well before the stream data. After max.task.idle.ms ksqlDB will start processing the table data if there is no streams data, meaning the table will be populated. You can then send in your streams data.

Alternatively, you can ensure you produce your streams data with later timestamps to the table data. This would be the most correct solution.

You can also use WITH(TIMESTAMP='something') to extract ROWTIME from the payload of your Kafka message if the timestamp used to produce the message is wrong.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM