[英]Spark Streaming Kafka
I am trying to read from a Kafka topic that has been set up by another team. 我正在尝试阅读另一个团队设置的Kafka主题。 The topic is balanced across multiple partitions.
该主题在多个分区之间保持平衡。 By this i mean that every new line is sent to a separate topic.
我的意思是每个新行都发送到一个单独的主题。 One message is multiple lines, so the message is split between the two partitions.
一条消息是多行,因此该消息在两个分区之间分配。
ex: 例如:
partition 1: 分区1:
"message1: details1 details1" “消息1:details1 details1”
"message2: details2 details2" “ message2:details2 details2”
partition 2: 分区2:
"details1 details1" “ details1 details1”
"details2 details2" “ details2 details2”
When I read the topic with createDirectStream(ssc, kafkaparams, fromoffsets, messagehandler)
, I get the RDDs in the order shown above. 当我用
createDirectStream(ssc, kafkaparams, fromoffsets, messagehandler)
阅读主题时,我按上面显示的顺序获取了RDD。
What I would like to do is: 我想做的是:
"message1: details1 details1" “消息1:details1 details1”
"details1 details1" “ details1 details1”
"message2: details2 details2" “ message2:details2 details2”
"details2 details2" “ details2 details2”
Thanks for any help I receive. 感谢您收到的任何帮助。
If the ordering inside each partitions is guaranteed so that element x in partition 1 relates to element x in partition 2, you could order the RDD element based on the partition number and the element index inside each partition iterator (zipWithIndex). 如果保证每个分区内的排序都使分区1中的元素x与分区2中的元素x相关,则可以基于分区号和每个分区迭代器(zipWithIndex)中的元素索引对RDD元素进行排序。
This would allow you to "re-sync" across partitions 这将允许您跨分区“重新同步”
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.