简体   繁体   中英

Apache Spark - Parallel Processing of messages from Kafka - Java

JavaPairReceiverInputDStream<String, byte[]> messages = KafkaUtils.createStream(...);
JavaPairDStream<String, byte[]> filteredMessages = filterValidMessages(messages);

JavaDStream<String> useCase1 = calculateUseCase1(filteredMessages);
JavaDStream<String> useCase2 = calculateUseCase2(filteredMessages);
JavaDStream<String> useCase3 = calculateUseCase3(filteredMessages);
JavaDStream<String> useCase4 = calculateUseCase4(filteredMessages);

...

I retrieve messages from Kafka, filter that and use the same messages for mutiple use-cases. Here useCase1 to 4 are independent of each other and can be calculated parallely. However, when i look at the logs, i see that calculations are happening sequentially. How can i make them to run parallely. Any suggestion would be helpful.

Try creating creating Kafka topics for each of your 4 use cases. Then try creating 4 different Kafka DStreams.

I moved all code inside a for loop and iterated by the number of partitions in the kafka topic and i see an improvement.

for(i=0;i<numOfPartitions;i++)
{
JavaPairReceiverInputDStream<String, byte[]> messages =
KafkaUtils.createStream(...);
JavaPairDStream<String, byte[]> filteredMessages =
filterValidMessages(messages);

JavaDStream<String> useCase1 = calculateUseCase1(filteredMessages);
JavaDStream<String> useCase2 = calculateUseCase2(filteredMessages);
JavaDStream<String> useCase3 = calculateUseCase3(filteredMessages);
JavaDStream<String> useCase4 = calculateUseCase4(filteredMessages);
}

Reference : http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM