How can I know that I have consumed all of a Kafka Topic?

Question

I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:

Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest();     // start from the earliest record possible
myConsumer.setStartFromLatest();       // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour

DataStream<String> stream = env.addSource(myConsumer);
...

Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)

Answer 1

Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics , which includes this explanation:

The difference between the committed offset and the most recent offset in 
each partition is called the consumer lag. If the Flink topology is consuming 
the data slower from the topic than new data is added, the lag will increase 
and the consumer will fall behind. For large production deployments we 
recommend monitoring that metric to avoid increasing latency.

So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.

Answer 2

Kafka it's used as a streaming source and a stream does not have an end.

If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic

So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.

Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.

How can I know that I have consumed all of a Kafka Topic?

Question

2 answers

solution1
2 ACCPTED 2018-01-24 20:37:46

solution2
0 2018-01-29 08:25:24

How can I know that I have consumed all of a Kafka Topic?

Question

2 answers

solution1 2 ACCPTED 2018-01-24 20:37:46

solution2 0 2018-01-29 08:25:24

solution1
2 ACCPTED 2018-01-24 20:37:46

solution2
0 2018-01-29 08:25:24