Spring Kafka - Consume last N messages for partitions(s) for any topic

Question

I'm trying to read the requested no of kafka messages. For non transactional messages we would seek from endoffset - N for M partitions start polling and collect messages where current offset is less than end offset for each partitions. For idempotent/transactional messages we have to account for transaction markers/duplicate messages and meaning offsets will not be continuous, in such case endoffset - N will not return N messages and we would need go back and seek for more messages until we have N messages for each partitions or beginning offset is reached

As there are multiple partitions I would need to keep track of all the offsets read so I can stop when all is done. There are two steps, first step to calculate the the start offset (end offset - requested no of messages) and end offset. ( the offsets are not continuous there are gaps) and I would seek the partition to start consuming from start offset. Second step is to poll the messages and count the messages in each partitions and if we don't meet the requested no of messages repeat first and second step again until we met the no of messages for each partition.

Conditions

Initial poll may not return any records so continue polling. Stop polling when you have reached the end offset for each partition or poll returns no results. Check each partition for messages read same as messages requested. If yes mark as complete, if no mark as continue and repeat steps. Account for gaps in messages. Should work for both transactional and non transactional producer.

Question:

How would I go about keeping track of all the messages have been read for each partition and break out of loop? Messages in each partition will come in order if it is helpful.

Does spring kafka support such use case? More details can be found here

Update : I'm asking to read last N messages in each partition. Partitions and no of messages is the user input. I would like to keep all the offset management in the memory. In essence we are trying to read the messages in the LIFO order. This makes it tricky as Kafka allows you to read forward not backward.

Answer 1

Why is there such a need, I don't understand. Kafka itself manages when there is nothing in the queue. If messages jump from state-to-state, one can have separate queues/topics. However, here's how one can do it.

When we consume messages from a partition using something like -

ConsumerIterator<byte[], byte[]> it = something; //initialize consumer
while (it.hasNext()) {
  MessageAndMetadata<byte[], byte[]> messageAndMetadata = it.next();
  String kafkaMessage = new String(messageAndMetadata.message());
  int partition = messageAndMetadata.partition();
  long offset = messageAndMetadata.offset();
  boolean processed = false;
  do{
    long maxOffset = something; //fetch from db
    //if offset<maxOffset, then process messages and manual commit
    //else busy wait or something more useful
  }while(processed);
}

We get information about about the offsets, partition number and the message itself. You can choose to do anything with this info.

For your use-case, you might also decide to persist the consumed offsets into a database so that the next time, offsets can be adjusted. Also, I would recommend shutdown hookup for cleanup and a final saving the processed offsets to DB.

Answer 2

So if I understand you correctly, this should be doable with a standard Kafka Consumer .

Consumer<?, Message> consumer = ...

public Map<Integer, List<Message>> readLatestFromPartitions(String topic, Collection<Integer> partitions, int count) {

    // create the TopicPartitions we want to read
    List<TopicPartition> tps = partitions.stream().map(p -> new TopicPartition(topic, p)).collect(toList());
    consumer.assign(tps);

    // create and initialize the result map
    Map<Integer, List<Message>> result = new HashMap<>();
    for (Integer i : partitions) { result.add(new ArrayList<>()); }

    // read until the expected count has been read for all partitions
    while (result.valueSet().stream().findAny(l -> l.size() < count)) {
        // read until the end of the topic
        ConsumerRecords<?, Message> records = consumer.poll(Duration.ofSeconds(5));
        while (records.count() > 0) {
            Iterator<ConsumerRecord<?, Message>> recordIterator = records.iterator();
            while (recordIterator.hasNext()) {
                ConsumerRecord<?, Message> record = recordIterator.next();
                List<Message> addTo = result.get(record.partition);
                // only allow 10 entries per partition
                if (addTo.size() >= count) {
                    addTo.remove(0);
                }
                addTo.add(record.value);
            }
            records = consumer.poll(Duration.ofSeconds(5));
        }
        // now we have read the whole topic for the given partitions.
        // if all lists contain the expected count, the loop will finish;
        // otherwise it will wait for more data to arrive.
    }

    // the map now contains the messages in the order they were sent,
    // we want them reversed (LIFO)
    Map<Integer, List<Message>> returnValue = new HashMap<>();
    result.forEach((k, v) -> returnValue.put(k, Collections.reverse(v)));
    return returnValue;
}

Answer 3

This can be achieve through stateStore in Kafka Stream. Which can be used by stream processing applications to store and query data. The Kafka Streams DSL, for example, automatically creates and manages such state stores when you are calling stateful operators such as count() or aggregate(), or when you are windowing a stream. This state store can be stored into RocksDB databse, an in-memory hash map so some other data structure. You can store RocksDB in some where persistent storage eg portworx to handle fault scenario.

A Kafka Streams application is typically running on many application instances. Because Kafka Streams partitions the data for processing it, an application's entire state is spread across the local state stores of the application's running instances. The Kafka Streams API lets you work with an application's state stores both locally (eg, on the level of an instance of the application) as well as in its entirety (on the level of the "logical" application), for example through stateful operations such as count() or through Interactive Queries.

Below show how you initialize StateStore

StoreBuilder<KeyValueStore<String, String>> statStore = Stores
                .keyValueStoreBuilder(Stores.persistentKeyValueStore("uniqueName"), Serdes.String(),
                        Serdes.String())
                .withLoggingDisabled(); // disable backing up the store to a change log topic

Below show how to add state store inside Kafka Stream

Topology builder = new Topology();
        builder.addSource("Source", topic)
                .addProcessor("SourceProcessName", () -> new ProcessorClass(), "Source")
                .addStateStore(statStore, "SourceProcessName")
                .addSink("SinkProcessName", sinkTopic, "SourceProcessName");

In Process Method You can store Kafka topic message as key,value

KeyValueStore<String, String> dsStore = (KeyValueStore<String, String>) context.getStateStore("statStore");
KeyValueIterator<String, String> iter = this.dsStore.all();
while (iter.hasNext()) {
                    KeyValue<String, String> entry = iter.next();
}

--------------------------Updated------------------------

In case to store start offset and endOffset into process we need to do slight change on Processor.

int lastNRecord=10;//Assume
        int startOffsetIndex=Get from seperate consumer e.g. Map<TopicPartition, Long> offsets = consumer.beginningOffsets();
//Pass these information to Kafka Stream
builder.addSource("Source", topic)
                .addProcessor("ProcessWaferMapWaiting", () -> new ProcessorClass(lastNRecord, startOffsetIndex, "Source")
                .addStateStore(countStoreSupplier, "ProcessWaferMapWaiting")
                .addSink("SinkWaferMapWaiting", sinkTopic, "ProcessWaferMapWaiting");

In Processor we need to track stored offset for each ket value so what i am thinking we can store Key as offset and value you could combined both key and value its totally optional what exactly you need for manipulation..If just Message value sufficient we can ignore message key.

In that case processor could be like below.

public class ProcessorClass implements Processor<String, String> {


        private int startOffsetIndex=0;
        private Long endOffsetIndex=0;

        private ProcessorContext context;
        private KeyValueStore<Long, String> dsStore;
        private long intervalMs = 600000;
        private long waitMsEachAsCall=100;
        private int lastNRecord=10;//Default

        //Get the startOffset from consumer and pass to the process
        public ProcessorClass(int lastNRecord,int startOffsetIndex) {
            this.lastNRecord=lastNRecord;
            this.startOffsetIndex=startOffsetIndex;
        }

        @Override
        @SuppressWarnings("unchecked")
        public void init(ProcessorContext context) {
            this.context = context;
            endOffsetIndex=context.offset();

            dsStore = (KeyValueStore<Long, String>) context.getStateStore("statStore");
            this.context.schedule(intervalMs, PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
                KeyValueIterator<Long, String> iter = this.dsStore.all();

                while (iter.hasNext()) {

                    KeyValue<Long, String> entry = iter.next();
                  //Itertae and check match key matched startOffsetIndex if yes
                    //For loop till lastNRecord

                    try {
                        //Sleep for some time before next AS call
                        Thread.sleep(waitMsEachAsCall);

                    } catch (InterruptedException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    }


                }

                if (iter != null)
                    iter.close();

                context.commit();
            });

        }

        @Override
        public void process(String key, String value) {

            if (key != null) {
                dsStore.put(endOffsetIndex, key+"|"+value);
                logger.info("Adding key on state store:" + endOffsetIndex+","+key+","+value);
            }

        }

        @Override
        public void close() {
            // nothing to do
        }

Spring Kafka - Consume last N messages for partitions(s) for any topic

Question

3 answers

solution1
0 2019-10-15 12:28:36

solution2
0 2019-10-16 08:53:09

solution3
0 2019-10-21 05:24:56

--------------------------Updated------------------------

Spring Kafka - Consume last N messages for partitions(s) for any topic

Question

3 answers

solution1 0 2019-10-15 12:28:36

solution2 0 2019-10-16 08:53:09

solution3 0 2019-10-21 05:24:56

--------------------------Updated------------------------

solution1
0 2019-10-15 12:28:36

solution2
0 2019-10-16 08:53:09

solution3
0 2019-10-21 05:24:56