简体繁体 English

Spark Streaming Kafka

[英]Spark Streaming Kafka

原文 2016-04-12 13:57:08 8 1 scala/ apache-spark/ streaming/ apache-kafka/ spark-streaming

I am trying to read from a Kafka topic that has been set up by another team. 我正在尝试阅读另一个团队设置的Kafka主题。 The topic is balanced across multiple partitions. 该主题在多个分区之间保持平衡。 By this i mean that every new line is sent to a separate topic. 我的意思是每个新行都发送到一个单独的主题。 One message is multiple lines, so the message is split between the two partitions. 一条消息是多行，因此该消息在两个分区之间分配。

ex: 例如：
partition 1: 分区1：
"message1: details1 details1" “消息1：details1 details1”
"message2: details2 details2" “ message2：details2 details2”

partition 2: 分区2：
"details1 details1" “ details1 details1”
"details2 details2" “ details2 details2”

When I read the topic with createDirectStream(ssc, kafkaparams, fromoffsets, messagehandler) , I get the RDDs in the order shown above. 当我用createDirectStream(ssc, kafkaparams, fromoffsets, messagehandler)阅读主题时，我按上面显示的顺序获取了RDD。

What I would like to do is: 我想做的是：

"message1: details1 details1" “消息1：details1 details1”
"details1 details1" “ details1 details1”
"message2: details2 details2" “ message2：details2 details2”
"details2 details2" “ details2 details2”

Thanks for any help I receive. 感谢您收到的任何帮助。

1 个解决方案

If the ordering inside each partitions is guaranteed so that element x in partition 1 relates to element x in partition 2, you could order the RDD element based on the partition number and the element index inside each partition iterator (zipWithIndex). 如果保证每个分区内的排序都使分区1中的元素x与分区2中的元素x相关，则可以基于分区号和每个分区迭代器（zipWithIndex）中的元素索引对RDD元素进行排序。

This would allow you to "re-sync" across partitions 这将允许您跨分区“重新同步”