简体   繁体   English

Spark Streaming Kafka

[英]Spark Streaming Kafka

I am trying to read from a Kafka topic that has been set up by another team. 我正在尝试阅读另一个团队设置的Kafka主题。 The topic is balanced across multiple partitions. 该主题在多个分区之间保持平衡。 By this i mean that every new line is sent to a separate topic. 我的意思是每个新行都发送到一个单独的主题。 One message is multiple lines, so the message is split between the two partitions. 一条消息是多行,因此该消息在两个分区之间分配。

ex: 例如:
partition 1: 分区1:
"message1: details1 details1" “消息1:details1 details1”
"message2: details2 details2" “ message2:details2 details2”

partition 2: 分区2:
"details1 details1" “ details1 details1”
"details2 details2" “ details2 details2”

When I read the topic with createDirectStream(ssc, kafkaparams, fromoffsets, messagehandler) , I get the RDDs in the order shown above. 当我用createDirectStream(ssc, kafkaparams, fromoffsets, messagehandler)阅读主题时,我按上面显示的顺序获取了RDD。

What I would like to do is: 我想做的是:

"message1: details1 details1" “消息1:details1 details1”
"details1 details1" “ details1 details1”
"message2: details2 details2" “ message2:details2 details2”
"details2 details2" “ details2 details2”

Thanks for any help I receive. 感谢您收到的任何帮助。

If the ordering inside each partitions is guaranteed so that element x in partition 1 relates to element x in partition 2, you could order the RDD element based on the partition number and the element index inside each partition iterator (zipWithIndex). 如果保证每个分区内的排序都使分区1中的元素x与分区2中的元素x相关,则可以基于分区号和每个分区迭代器(zipWithIndex)中的元素索引对RDD元素进行排序。

This would allow you to "re-sync" across partitions 这将允许您跨分区“重新同步”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM