Apache Kafka - 关于主题/分区的KafkaStream

Question

I am writing Kafka Consumer for high volume high velocity distributed application. 我正在为高容量高速分布式应用编写Kafka Consumer。 I have only one topic but rate incoming messages is very high. 我只有一个主题，但传入消息的速率非常高。 Having multiple partition that serve more consumer would be appropriate for this use-case. 拥有多个服务于更多消费者的分区将适用于此用例。 Best way to consume is to have multiple stream readers. 最好的消费方式是拥有多个流阅读器。 As per the documentation or available samples, number of KafkaStreams the ConsumerConnector gives out is based on number of topics. 根据文档或可用样本，ConsumerConnector提供的KafkaStream数量基于主题数量。 Wondering how to get more than one KafkaStream readers [based on the partition], so that I can span one thread per stream or Reading from same KafkaStream in multiple threads would do the concurrent read from multiple partitions? 想知道如何获得多个KafkaStream读取器[基于分区]，这样我可以跨每个流跨越一个线程或从多个线程中的相同KafkaStream读取将从多个分区进行并发读取？

Any insights are much appreciated. 任何见解都非常感谢。

Answer 1

Would like to share what I found from mailing list: 想分享我在邮件列表中找到的内容：

The number that you pass in the topic map controls how many streams a topic is divided into. 您在主题图中传递的数字控制主题分为多少个流。 In your case, if you pass in 1, all 10 partitions's data will be fed into 1 stream. 在您的情况下，如果传入1，则所有10个分区的数据将被送入1个流。 If you pass in 2, each of the 2 streams will get data from 5 partitions. 如果传入2，则2个流中的每一个都将从5个分区获取数据。 If you pass in 11, 10 of them will each get data from 1 partition and 1 stream will get nothing. 如果你传入11，其中10个将从1个分区获得数据，1个流将不会得到任何结果。

Typically, you need to iterate each stream in its own thread. 通常，您需要在自己的线程中迭代每个流。 This is because each stream can block forever if there is no new event. 这是因为如果没有新事件，每个流都可以永久阻止。

Sample snippet: 示例代码段：

topicCount.put(msgTopic, new Integer(partitionCount));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = connector.createMessageStreams(topicCount);
List<KafkaStream<byte[], byte[]>> streams = consumerStreams.get(msgTopic);

for (final KafkaStream stream : streams) {
    ReadTask task = new ReadTask(stream, msgTopic);
    task.addObserver(this.msgObserver);
    tasks.add(task); executor.submit(task);
}

Reference: http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201201.mbox/%3CCA+sHyy_Z903dOmnjp7_yYR_aE2sRW-x7XpAnqkmWaP66GOqf6w@mail.gmail.com%3E 参考： http ： //mail-archives.apache.org/mod_mbox/incubator-kafka-users/201201.mbox/%3CCA+sHyy_Z903dOmnjp7_yYR_aE2sRW-x7XpAnqkmWaP66GOqf6w@mail.gmail.com%3E

Answer 2

The recommended way to do this is to have a thread pool so Java can handle organisation for you and for each stream the createMessageStreamsByFilter method gives you consume it in a Runnable. 建议的方法是使用线程池，以便Java可以为您处理组织，并且createMessageStreamsByFilter方法允许您在Runnable中使用它。 For example: 例如：

int NUMBER_OF_PARTITIONS = 6;
Properties consumerConfig = new Properties();
consumerConfig.put("zk.connect", "zookeeper.mydomain.com:2181" );
consumerConfig.put("backoff.increment.ms", "100");
consumerConfig.put("autooffset.reset", "largest");
consumerConfig.put("groupid", "java-consumer-example");
consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(consumerConfig));

TopicFilter sourceTopicFilter = new Whitelist("mytopic|myothertopic");
List<KafkaStream<Message>> streams = consumer.createMessageStreamsByFilter(sourceTopicFilter, NUMBER_OF_PARTITIONS);

ExecutorService executor = Executors.newFixedThreadPool(streams.size());
for(final KafkaStream<Message> stream: streams){
    executor.submit(new Runnable() {
        public void run() {
            for (MessageAndMetadata<Message> msgAndMetadata: stream) {
                ByteBuffer buffer = msgAndMetadata.message().payload();
                byte [] bytes = new byte[buffer.remaining()];
                buffer.get(bytes);
                //Do something with the bytes you just got off Kafka.
            }
        }
    });
}

In this example I asked for 6 threads basically because I know that I have 3 partitions for each topic and I listed two topics in my whitelist. 在这个例子中，我基本上要求6个线程，因为我知道每个主题有3个分区，我在白名单中列出了两个主题。 Once we have the handles of the incoming streams we can iterate over their content, which are MessageAndMetadata objects. 一旦我们获得了传入流的句柄，我们就可以迭代它们的内容，即MessageAndMetadata对象。 Metadata is really just the topic name and offset. 元数据实际上只是主题名称和偏移量。 As you discovered you can do it in a single thread if you ask for 1 stream instead of, in my example 6, but if you require parallel processing the nice way is to launch an executor with one thread for each returned stream. 正如您所发现的那样，如果您要求1个流而不是在我的示例6中，您可以在单个线程中执行此操作，但是如果您需要并行处理，那么很好的方法是为每个返回的流启动一个带有一个线程的执行程序。

Answer 3

/**
 * @param source : source kStream to sink output-topic
 */
private static void pipe(KStream<String, String> source) {
    source.to(Serdes.String(), Serdes.String(), new StreamPartitioner<String, String>() {

        @Override
        public Integer partition(String arg0, String arg1, int arg2) {
            return 0;
        }
    }, "output-topic");
}

above code will write record at partition 1 of topic name "output-topic" 上面的代码将在主题名称“output-topic”的分区1处写入记录

Apache Kafka - 关于主题/分区的KafkaStream

问题描述

3 个解决方案

解决方案1
15 已采纳 2014-05-20 05:54:03

解决方案2
4 2014-05-26 11:35:41

解决方案3
1 2017-03-21 16:12:11

Apache Kafka - 关于主题/分区的KafkaStream

问题描述

3 个解决方案

解决方案1 15 已采纳 2014-05-20 05:54:03

解决方案2 4 2014-05-26 11:35:41

解决方案3 1 2017-03-21 16:12:11

解决方案1
15 已采纳 2014-05-20 05:54:03

解决方案2
4 2014-05-26 11:35:41

解决方案3
1 2017-03-21 16:12:11