简体   繁体   中英

Kafka Connect S3 Sink Connector Partitioning large topics by id field

We've been working on adding Kafka Connect to our data platform for the last few weeks and think it would be a useful way of extracting data from Kafka into an S3 datalake. We've played around with FieldPartitioner and the TimeBasePartitioner and seen some pretty decent results.

We also have the need to partition by user id - but having tried using the FieldPartitioner on a user id field the connector is extremely slow - especially compared to partitioning by date etc. I understand that partitioning by an id will create a lot of output partitions and thus won't be as fast - which is fine but it needs to be able to keep up with producers.

So far we've tried increasing memory and heap - but we don't usually see any memory issues unless we bump the flush.size to a large number. We've also tried small flush sizes, very small and large rotate.schedule.interval.ms configurations. We've also looked at networking, but that seems to be fine - using other partitioners the network keeps up fine.

Before potentially wasting a lot of time on this has anyone attempted or succeeded in partitioning by an id field, especially on larger topics, using the S3 Sink Connector? Or has anyone got any suggestions in terms of configuration or setup that might be a good place to look?

I'm not used to Kafka's connector, but I will at least try to help.

I am not aware if you can configure the connector to kafka topic's partition level; I am assuming there's some way to do that here.

One possible way to do this would be focused on the step where your clients produce to the Kafka brokers . My suggestion is to implement your own Partitioner , in order to have a "further" control of where you want to send the data on kafka's side.

This is an example/simplification of your custom partitioner. For example, the key your producers send has this format: id_name_date . This custom partitioner tries to extract the first element ( id ) and then chooses the desired partition.

public class IdPartitioner implements Partitioner 
{       
   @Override
   public int partition(String topic, Object key, byte[] kb, 
                        Object v, byte[] vb, Cluster cl) 
   {
       try 
       {
         String pKey= (String) key;
         int id = Integer.parseInt(pKey.substring(0,pKey.indexOf("_")));
        
          /* getPartitionForId would decide which partition number corresponds
           for the received ID.You could also implement the logic directly here.*/

         return getPartitionForId(id);
       }
       catch (Exception e)
       {return 0;}
   }

   @Override
   public void close() 
   {
     //maybe some work here if needed
   }
}

Even if you'll may need some more tunning on KafkaConnect side, I believe this option may be helpful. Assuming you have a topic with 5 partitions, and that getPartitionForId just checks the first number of the ID in order to decide the partition ( for simplification purposes, min Id is 100 and max Id is 599 ).

So if the received key is, fe: 123_tempdata_20201203 , the partition method would return 0 , that is, the 1st partition.

(The image shows P1 instead of P0 because i believe the example looks more natural this way, but be aware that the 1st partition is in fact defined as partition 0 . Ok to be honest I forgot about P0 while painting this and didn't save the template, so I had to search for an excuse, like: looks more natural).

简化代理端分区

Basically this would be a pre-adjustment , or acommodation , before the S3 upload.

I am aware maybe this isn't the ideal answer, as I don't know the exact specifications of your system. My guess is that there's some possibility to directly point topic partitions to s3 locations .

If there's no possibility to do so, at least I hope this could give you some further ideas. Cheers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM