Kafka Connect S3 Sink Connector 按 id 字段对大型主题进行分区

Question

We've been working on adding Kafka Connect to our data platform for the last few weeks and think it would be a useful way of extracting data from Kafka into an S3 datalake.过去几周，我们一直致力于将 Kafka Connect 添加到我们的数据平台，并认为这将是一种将 Kafka 中的数据提取到 S3 数据湖中的有用方法。 We've played around with FieldPartitioner and the TimeBasePartitioner and seen some pretty decent results.我们已经使用 FieldPartitioner 和 TimeBasePartitioner 并看到了一些相当不错的结果。

We also have the need to partition by user id - but having tried using the FieldPartitioner on a user id field the connector is extremely slow - especially compared to partitioning by date etc. I understand that partitioning by an id will create a lot of output partitions and thus won't be as fast - which is fine but it needs to be able to keep up with producers.我们还需要按用户 ID 进行分区 - 但是尝试在用户 ID 字段上使用 FieldPartitioner 后，连接器速度非常慢 - 尤其是与按日期分区等相比。我知道按 ID 分区会创建很多 output 分区因此不会那么快——这很好，但它需要能够跟上生产者的步伐。

So far we've tried increasing memory and heap - but we don't usually see any memory issues unless we bump the flush.size to a large number.到目前为止，我们已经尝试增加 memory 和堆 - 但我们通常不会看到任何 memory 问题，除非我们将 flush.size 增加到一个很大的数字。 We've also tried small flush sizes, very small and large rotate.schedule.interval.ms configurations.我们还尝试了小刷新大小、非常小和大的 rotate.schedule.interval.ms 配置。 We've also looked at networking, but that seems to be fine - using other partitioners the network keeps up fine.我们还研究了网络，但这似乎很好 - 使用其他分区器网络保持正常。

Before potentially wasting a lot of time on this has anyone attempted or succeeded in partitioning by an id field, especially on larger topics, using the S3 Sink Connector?在可能浪费大量时间之前，是否有人尝试或成功通过 id 字段进行分区，尤其是在较大的主题上，使用 S3 Sink 连接器？ Or has anyone got any suggestions in terms of configuration or setup that might be a good place to look?或者有没有人在配置或设置方面有任何建议，可能是一个不错的地方？

Answer 1

I'm not used to Kafka's connector, but I will at least try to help.我不习惯 Kafka 的连接器，但我至少会尝试提供帮助。

I am not aware if you can configure the connector to kafka topic's partition level;我不知道您是否可以将连接器配置为 kafka 主题的分区级别； I am assuming there's some way to do that here.我假设这里有一些方法可以做到这一点。

One possible way to do this would be focused on the step where your clients produce to the Kafka brokers .一种可能的方法是专注于您的客户向 Kafka 代理生产的步骤。 My suggestion is to implement your own Partitioner , in order to have a "further" control of where you want to send the data on kafka's side.我的建议是实现您自己的Partitioner ，以便“进一步”控制您希望在 kafka 方面发送数据的位置。

This is an example/simplification of your custom partitioner.这是您的自定义分区器的示例/简化。 For example, the key your producers send has this format: id_name_date .例如，您的生产者发送的key具有以下格式： id_name_date 。 This custom partitioner tries to extract the first element ( id ) and then chooses the desired partition.此自定义分区器尝试提取第一个元素 ( id )，然后选择所需的分区。

public class IdPartitioner implements Partitioner 
{       
   @Override
   public int partition(String topic, Object key, byte[] kb, 
                        Object v, byte[] vb, Cluster cl) 
   {
       try 
       {
         String pKey= (String) key;
         int id = Integer.parseInt(pKey.substring(0,pKey.indexOf("_")));
        
          /* getPartitionForId would decide which partition number corresponds
           for the received ID.You could also implement the logic directly here.*/

         return getPartitionForId(id);
       }
       catch (Exception e)
       {return 0;}
   }

   @Override
   public void close() 
   {
     //maybe some work here if needed
   }
}

Even if you'll may need some more tunning on KafkaConnect side, I believe this option may be helpful.即使您可能需要在KafkaConnect方面进行更多调整，我相信此选项可能会有所帮助。 Assuming you have a topic with 5 partitions, and that getPartitionForId just checks the first number of the ID in order to decide the partition ( for simplification purposes, min Id is 100 and max Id is 599 ).假设您有一个包含 5 个分区的主题，并且getPartitionForId只是检查 ID 的第一个数字以确定分区（为简化起见，min Id 为 100，max Id 为 599 ）。

So if the received key is, fe: 123_tempdata_20201203 , the partition method would return 0 , that is, the 1st partition.因此，如果接收到的 key 是 fe: 123_tempdata_20201203 ，则分区方法将返回0 ，即第一个分区。

(The image shows P1 instead of P0 because i believe the example looks more natural this way, but be aware that the 1st partition is in fact defined as partition 0 . Ok to be honest I forgot about P0 while painting this and didn't save the template, so I had to search for an excuse, like: looks more natural). （图像显示 P1 而不是 P0 因为我相信这个例子看起来更自然，但请注意，第一个分区实际上定义为partition 0 。好吧，老实说，我在画这个时忘记了 P0 并且没有保存模板，所以我不得不寻找一个借口，比如：看起来更自然）。

Basically this would be a pre-adjustment , or acommodation , before the S3 upload.基本上，这将是 S3 上传之前的预调整或住宿。

I am aware maybe this isn't the ideal answer, as I don't know the exact specifications of your system.我知道这可能不是理想的答案，因为我不知道您系统的确切规格。 My guess is that there's some possibility to directly point topic partitions to s3 locations .我的猜测是有可能将主题分区直接指向 s3 位置。

If there's no possibility to do so, at least I hope this could give you some further ideas.如果没有可能这样做，至少我希望这可以给你一些进一步的想法。 Cheers!干杯!

Kafka Connect S3 Sink Connector 按 id 字段对大型主题进行分区

问题描述

1 个解决方案

解决方案1
1 2020-12-05 03:51:34

Kafka Connect S3 Sink Connector 按 id 字段对大型主题进行分区

问题描述

1 个解决方案

解决方案1 1 2020-12-05 03:51:34

解决方案1
1 2020-12-05 03:51:34