简体   繁体   English

卡夫卡流加入

[英]Kafka stream join

I have 2 kafka topics - recommendations and clicks . 我有2个kafka主题 - recommendationsclicks The first topic has recommendations object keyed by a unique Id (called recommendationsId ). 第一个主题具有由唯一ID(称为recommendationsId )键入的建议对象。 Each product has a URL which the user can click. 每个产品都有一个用户可以单击的URL。

The clicks topic gets the messages generated by clicks on those product URLs recommended to the user. clicks主题获取通过向用户推荐的那些产品URL的点击生成的消息。 It has been so set up that these click messages are also keyed by the recommendationId . 它已设置为这些点击消息也由recommendationId键入。

Note that 注意

  1. relationship between recommendations and clicks is one-to-many. 建议和点击之间的关系是一对多的。 A recommendations may lead to multiple clicks but a click is always associated with a single recommendation. 建议可能会导致多次点击,但点击始终与单个推荐相关联。

  2. each click object would have a corresponding recommendations object. 每个click对象都有一个相应的推荐对象。

  3. a click object would have a timestamp later than the recommendations object. 点击对象的时间戳晚于推荐对象。

  4. the gap between a recommendation and the corresponding click(s) could be a few seconds to a few days (say, 7 days at the most). 推荐和相应点击之间的差距可能是几秒到几天(比如最多7天)。

My goal is to join these two topics using Kafka streams join. 我的目标是使用Kafka stream join加入这两个主题。 What I am not clear about is whether I should use a KStream x KStream join or a KStream x KTable join. 我不清楚的是我是否应该使用KStream x KStream连接或KStream x KTable连接。

I implemented the KStream x KTable join by joining clicks stream by recommendations table. 我通过按recommendations表加入clicks流来实现KStream x KTable加入。 However, I am not able to see any joined clicks-recommendations pair if the recommendations were generated before the joiner was started and the click arrives after the joiner started. 但是,如果建议是加入者启动之前生成的并且在加入者启动后点击到达,则无法看到任何加入的点击建议对。

Am I using the right join? 我使用正确的加入吗? Should I be using KStream x KStream join? 我应该使用KStream x KStream加入吗? If so, in order to be able to join a click with a recommendation at most 7 days in the past, should I set the window size to 7 days? 如果是这样,为了能够在过去7天内加入带有推荐的点击,我应该将窗口大小设置为7天吗? Do I also need to set the "retention" period in this case? 在这种情况下,我还需要设置“保留”期吗?

My code to perform KStream x KTable join is as follows. 我执行KStream x KTable连接的代码如下。 Note that I have defined classes Recommendations and Click and their corresponding serde. 请注意,我已经定义了类别RecommendationsClick以及它们对应的serde。 The click messages are just plain String (url). 点击消息只是普通的String (url)。 This URL String is joined with Recommendations object to create a Click object which is emitted to the jointTopic . 此URL字符串与Recommendations对象连接,以创建一个发送到jointTopicClick对象。

public static void main(String[] args){
    if(args.length!=4){
      throw new RuntimeException("Expected 3 params: bootstraplist clickTopic recsTopic jointTopic");
    }

    final String booststrapList = args[0];
    final String clicksTopic = args[1];
    final String recsTopic = args[2];
    final String jointTopic = args[3];

    Properties config = new Properties();
    config.put(StreamsConfig.APPLICATION_ID_CONFIG, "my_joiner_id");
    config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, booststrapList);
    config.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    config.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, JoinSerdes.CLICK_SERDE.getClass().getName());

    KStreamBuilder builder = new KStreamBuilder();

    // load clicks as KStream
    KStream<String, String> clicksStream = builder.stream(Serdes.String(), Serdes.String(), clicksTopic);

    // load recommendations as KTable
    KTable<String, Recommendations> recsTable = builder.table(Serdes.String(), JoinSerdes.RECS_SERDE, recsTopic);

    // join the two
    KStream<String, Click> join = clicksStream.leftJoin(recsTable, (click, recs) -> new Click(click, recs));

    // emit the join to the jointTopic
    join.to(Serdes.String(), JoinSerdes.CLICK_SERDE, jointTopic);

    // let the action begin
    KafkaStreams streams = new KafkaStreams(builder, config);
    streams.start();
  }

This works fine as long as both recommendations and clicks have been generated after the joiner (the above program) is run. 这只要木匠(以上程序)运行已产生两个建议,并点击工作正常。 If, however, a click arrives for which the recommendation was generated before the joiner was run, I don't see any join happening. 但是,如果在网站运行之前生成推荐的点击到达,我看不到任何连接发生。 How do I fix this? 我该如何解决?

If the solution is to use KStream x KSTream join, then please help me understand what window size I should select and what retention period to select. 如果解决方案是使用KStream x KSTream连接,那么请帮助我了解我应该选择哪个窗口大小以及选择的保留期限。

Your overall observation is correct. 你的整体观察是正确的。 Conceptually, you can get the correct result both ways. 从概念上讲,您可以通过两种方式获得正确的结果。 If you use stream-table join, you have two disadvantages (this might be revisited and improved in future release of Kafka though) 如果使用流表连接,则有两个缺点(可能会在将来的Kafka版本中重新访问和改进)

  • You mentioned already that if a click get's processed before the corresponding recommendation, the (inner-)join will fail. 您已经提到过,如果在相应的推荐之前处理了点击,则(内部)联接将失败。 However, as you know that there will be recommendation, you could use a left-join instead of inner-join, check the join result, and write the click event back to the input topic if the recommendation was null (ie, you get a retry logic) -- or course, consecutive clicks for a single recommendation might get out of order and you might need to account for this in you application code. 但是,如您所知,将有推荐,您可以使用left-join而不是inner-join,检查连接结果,如果推荐为null ,则将click事件写回输入主题(即,您获得了重试逻辑) - 当然,单个推荐的连续点击可能会出现故障,您可能需要在应用程序代码中考虑到这一点。
  • A second disadvantage of KTable would be, that it will grow forever and unbounded over time, as you will add more and more unique recommendations to it. KTable的第二个缺点是,随着时间的推移它会永远增长并且无限制,因为你会为它添加越来越多的独特建议。 Thus, you will need to implement some "expiration logic" by sending tombstones records of the form <recommendationsId, null> to the recommendation topic to delete old recommendations you don't care about any longer. 因此,您需要通过将<recommendationsId, null>形式的逻辑删除记录发送到推荐主题来实现一些“过期逻辑”,以删除您不再关心的旧建议。
  • The advantage of this approach is, that you will need less memory/disk space in total, compared to a stream-stream join, because you only need to buffer all recommendations in you application (but no clicks). 这种方法的优点是,与流 - 流连接相比,总共需要更少的内存/磁盘空间,因为您只需缓冲应用程序中的所有建议(但没有点击)。

If you use a stream-stream join, and a click can happen 7 days after a recommendation, your window size must be 7 days -- otherwise, the click would not join with the recommendation. 如果您使用流 - 流联接,并且在推荐后7天可能发生单击,则您的窗口大小必须为7天 - 否则,点击将不会与推荐一起加入。

  • The disadvantage of this approach is, that you will need much more memory/disk as you will buffer all clicks and all recommendations of the last 7 days in your applications. 这种方法的缺点是,您将需要更多的内存/磁盘,因为您将缓冲应用程序中过去7天的所有点击和所有建议。
  • The advantage is, that the order or processing (ie, recommendation vs click) does not matter anymore (ie, you don't need to implement the retry strategy as describes above) 优点是,订单或处理(即推荐与点击)不再重要(即,您不需要像上面描述的那样实施重试策略)
  • Furthermore, old recommendations will outdate automatically and thus you don't need to implement special "expiration logic". 此外,旧的建议会自动过时,因此您不需要实现特殊的“过期逻辑”。

For stream-stream join the retention time answer is a little different. 对于流 - 流加入,保留时间的答案略有不同。 It must be at lease 7 days, as the window size is 7 days. 它必须至少7天,因为窗口大小是7天。 Otherwise, you would delete records of your "running window". 否则,您将删除“运行窗口”的记录。 You can also set the retention period longer, to be able to process "late data". 您还可以将保留期设置得更长,以便能够处理“延迟数据”。 Assume a user clicks at the end the window timeframe (5 minute before the 7 day time span of the recommendation ends), but the click is only reported 1 hour later to your application. 假设用户在窗口时间范围结束时(推荐的7天时间跨度前5分钟)点击,但点击仅在1小时后报告给您的应用程序。 If your retention period is 7 days as your window size, this late arriving record cannot be processed anymore (as the recommendation would have been deleted already). 如果您的保留期限为7天作为您的窗口大小,则此迟到的记录将无法再处理(因为建议已被删除)。 If you set a larger retention period of, eg, 8 days you still can process late records. 如果您设置较长的保留期,例如8天,您仍然可以处理延迟记录。 It depends on you application/semantical need what retention time you want to use. 这取决于您的应用程序/语义需要您想要使用的保留时间。

Summary : From an implementation point of view, using stream-stream join is simpler than using stream-table join. 简介 :从实现的角度来看,使用流 - 流连接比使用流表连接更简单。 However, memory/disk savings are expected and could be large depending on your click stream data rate. 但是,预计可以节省内存/磁盘,并且可能会很大,具体取决于您的点击流数据速率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM