[英]Kafka stream join
I have 2 kafka topics - recommendations
and clicks
. 我有2个kafka主题 -
recommendations
和clicks
。 The first topic has recommendations object keyed by a unique Id (called recommendationsId
). 第一个主题具有由唯一ID(称为
recommendationsId
)键入的建议对象。 Each product has a URL which the user can click. 每个产品都有一个用户可以单击的URL。
The clicks
topic gets the messages generated by clicks on those product URLs recommended to the user. clicks
主题获取通过向用户推荐的那些产品URL的点击生成的消息。 It has been so set up that these click messages are also keyed by the recommendationId
. 它已设置为这些点击消息也由
recommendationId
键入。
Note that 注意
relationship between recommendations and clicks is one-to-many. 建议和点击之间的关系是一对多的。 A recommendations may lead to multiple clicks but a click is always associated with a single recommendation.
建议可能会导致多次点击,但点击始终与单个推荐相关联。
each click object would have a corresponding recommendations object. 每个click对象都有一个相应的推荐对象。
a click object would have a timestamp later than the recommendations object. 点击对象的时间戳晚于推荐对象。
the gap between a recommendation and the corresponding click(s) could be a few seconds to a few days (say, 7 days at the most). 推荐和相应点击之间的差距可能是几秒到几天(比如最多7天)。
My goal is to join these two topics using Kafka streams join. 我的目标是使用Kafka stream join加入这两个主题。 What I am not clear about is whether I should use a KStream x KStream join or a KStream x KTable join.
我不清楚的是我是否应该使用KStream x KStream连接或KStream x KTable连接。
I implemented the KStream x KTable
join by joining clicks
stream by recommendations
table. 我通过按
recommendations
表加入clicks
流来实现KStream x KTable
加入。 However, I am not able to see any joined clicks-recommendations pair if the recommendations were generated before the joiner was started and the click arrives after the joiner started. 但是,如果建议是在加入者启动之前生成的,并且在加入者启动后点击到达,则无法看到任何加入的点击建议对。
Am I using the right join? 我使用正确的加入吗? Should I be using
KStream x KStream
join? 我应该使用
KStream x KStream
加入吗? If so, in order to be able to join a click with a recommendation at most 7 days in the past, should I set the window size to 7 days? 如果是这样,为了能够在过去7天内加入带有推荐的点击,我应该将窗口大小设置为7天吗? Do I also need to set the "retention" period in this case?
在这种情况下,我还需要设置“保留”期吗?
My code to perform KStream x KTable
join is as follows. 我执行
KStream x KTable
连接的代码如下。 Note that I have defined classes Recommendations
and Click
and their corresponding serde. 请注意,我已经定义了类别
Recommendations
和Click
以及它们对应的serde。 The click messages are just plain String
(url). 点击消息只是普通的
String
(url)。 This URL String is joined with Recommendations
object to create a Click
object which is emitted to the jointTopic
. 此URL字符串与
Recommendations
对象连接,以创建一个发送到jointTopic
的Click
对象。
public static void main(String[] args){
if(args.length!=4){
throw new RuntimeException("Expected 3 params: bootstraplist clickTopic recsTopic jointTopic");
}
final String booststrapList = args[0];
final String clicksTopic = args[1];
final String recsTopic = args[2];
final String jointTopic = args[3];
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "my_joiner_id");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, booststrapList);
config.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
config.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, JoinSerdes.CLICK_SERDE.getClass().getName());
KStreamBuilder builder = new KStreamBuilder();
// load clicks as KStream
KStream<String, String> clicksStream = builder.stream(Serdes.String(), Serdes.String(), clicksTopic);
// load recommendations as KTable
KTable<String, Recommendations> recsTable = builder.table(Serdes.String(), JoinSerdes.RECS_SERDE, recsTopic);
// join the two
KStream<String, Click> join = clicksStream.leftJoin(recsTable, (click, recs) -> new Click(click, recs));
// emit the join to the jointTopic
join.to(Serdes.String(), JoinSerdes.CLICK_SERDE, jointTopic);
// let the action begin
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
This works fine as long as both recommendations and clicks have been generated after the joiner (the above program) is run. 这只要木匠(以上程序)运行后已产生两个建议,并点击工作正常。 If, however, a click arrives for which the recommendation was generated before the joiner was run, I don't see any join happening.
但是,如果在网站运行之前生成推荐的点击到达,我看不到任何连接发生。 How do I fix this?
我该如何解决?
If the solution is to use KStream x KSTream
join, then please help me understand what window size I should select and what retention period to select. 如果解决方案是使用
KStream x KSTream
连接,那么请帮助我了解我应该选择哪个窗口大小以及选择的保留期限。
Your overall observation is correct. 你的整体观察是正确的。 Conceptually, you can get the correct result both ways.
从概念上讲,您可以通过两种方式获得正确的结果。 If you use stream-table join, you have two disadvantages (this might be revisited and improved in future release of Kafka though)
如果使用流表连接,则有两个缺点(可能会在将来的Kafka版本中重新访问和改进)
null
(ie, you get a retry logic) -- or course, consecutive clicks for a single recommendation might get out of order and you might need to account for this in you application code. null
,则将click事件写回输入主题(即,您获得了重试逻辑) - 当然,单个推荐的连续点击可能会出现故障,您可能需要在应用程序代码中考虑到这一点。 KTable
would be, that it will grow forever and unbounded over time, as you will add more and more unique recommendations to it. KTable
的第二个缺点是,随着时间的推移它会永远增长并且无限制,因为你会为它添加越来越多的独特建议。 Thus, you will need to implement some "expiration logic" by sending tombstones records of the form <recommendationsId, null>
to the recommendation topic to delete old recommendations you don't care about any longer. <recommendationsId, null>
形式的逻辑删除记录发送到推荐主题来实现一些“过期逻辑”,以删除您不再关心的旧建议。 If you use a stream-stream join, and a click can happen 7 days after a recommendation, your window size must be 7 days -- otherwise, the click would not join with the recommendation. 如果您使用流 - 流联接,并且在推荐后7天可能发生单击,则您的窗口大小必须为7天 - 否则,点击将不会与推荐一起加入。
For stream-stream join the retention time answer is a little different. 对于流 - 流加入,保留时间的答案略有不同。 It must be at lease 7 days, as the window size is 7 days.
它必须至少7天,因为窗口大小是7天。 Otherwise, you would delete records of your "running window".
否则,您将删除“运行窗口”的记录。 You can also set the retention period longer, to be able to process "late data".
您还可以将保留期设置得更长,以便能够处理“延迟数据”。 Assume a user clicks at the end the window timeframe (5 minute before the 7 day time span of the recommendation ends), but the click is only reported 1 hour later to your application.
假设用户在窗口时间范围结束时(推荐的7天时间跨度前5分钟)点击,但点击仅在1小时后报告给您的应用程序。 If your retention period is 7 days as your window size, this late arriving record cannot be processed anymore (as the recommendation would have been deleted already).
如果您的保留期限为7天作为您的窗口大小,则此迟到的记录将无法再处理(因为建议已被删除)。 If you set a larger retention period of, eg, 8 days you still can process late records.
如果您设置较长的保留期,例如8天,您仍然可以处理延迟记录。 It depends on you application/semantical need what retention time you want to use.
这取决于您的应用程序/语义需要您想要使用的保留时间。
Summary : From an implementation point of view, using stream-stream join is simpler than using stream-table join. 简介 :从实现的角度来看,使用流 - 流连接比使用流表连接更简单。 However, memory/disk savings are expected and could be large depending on your click stream data rate.
但是,预计可以节省内存/磁盘,并且可能会很大,具体取决于您的点击流数据速率。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.