[英]spark streaming join kafka topics
We have two InputDStream
from two Kafka topics, but we have to join the data of these two input together. 我们有两个Kafka主题中的两个
InputDStream
,但是我们必须将这两个输入的数据连接在一起。 The problem is that each InputDStream
is processed independently, because of the foreachRDD
, nothing can be returned, to join
after. 问题是,每个
InputDStream
独立处理,因为的foreachRDD
,可以返回什么,给join
后。
var Message1ListBuffer = new ListBuffer[Message1]
var Message2ListBuffer = new ListBuffer[Message2]
inputDStream1.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map({ msg =>
val r = msg.value()
val avro = AvroUtils.objectToAvro(r.getSchema, r)
val messageValue = AvroInputStream.json[FMessage1](avro.getBytes("UTF-8")).singleEntity.get
Message1ListBuffer = Message1FlatMapper.flatmap(messageValue)
Message1ListBuffer
})
inputDStream1.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
})
inputDStream2.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map({ msg =>
val r = msg.value()
val avro = AvroUtils.objectToAvro(r.getSchema, r)
val messageValue = AvroInputStream.json[FMessage2](avro.getBytes("UTF-8")).singleEntity.get
Message2ListBuffer = Message1FlatMapper.flatmap(messageValue)
Message2ListBuffer
})
inputDStream2.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
})
I thought I could return Message1ListBuffer and Message2ListBuffer, turn them into dataframes and join them. 我以为我可以返回Message1ListBuffer和Message2ListBuffer,将它们转换为数据帧并加入它们。 But that does not work, and I do not think it's the best choice
但这是行不通的,我认为这不是最佳选择
From there, what is the way to return the rdd of each foreachRDD in order to make a join? 从那里,返回每个foreachRDD的rdd以便进行联接的方法是什么?
inputDStream1.foreachRDD(rdd => {
})
inputDStream2.foreachRDD(rdd => {
})
Not sure about the Spark version you are using, with Spark 2.3+, it can be achieved directly. 不确定您使用的Spark版本是否为Spark 2.3+,可以直接实现。
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic1")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load
val ds2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic2")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load
val stream1 = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val stream2 = ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
resultStream = stream1.join(stream2)
more join operations here 这里更多的加入操作
Warning:
警告:
Delay records will not get a join match.
延迟记录将不会获得联接匹配。 Need to tweak buffer a bit.
需要调整缓冲一点。 more information found here
在这里找到更多信息
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.