Spark Streaming加入Kafka主题

Question

We have two InputDStream from two Kafka topics, but we have to join the data of these two input together. 我们有两个Kafka主题中的两个InputDStream ，但是我们必须将这两个输入的数据连接在一起。 The problem is that each InputDStream is processed independently, because of the foreachRDD , nothing can be returned, to join after. 问题是，每个InputDStream独立处理，因为的foreachRDD ，可以返回什么，给join后。

  var Message1ListBuffer = new ListBuffer[Message1]
  var Message2ListBuffer = new ListBuffer[Message2]

    inputDStream1.foreachRDD(rdd => {
      if (!rdd.partitions.isEmpty) {
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        rdd.map({ msg =>
          val r = msg.value()
          val avro = AvroUtils.objectToAvro(r.getSchema, r)
          val messageValue = AvroInputStream.json[FMessage1](avro.getBytes("UTF-8")).singleEntity.get
          Message1ListBuffer = Message1FlatMapper.flatmap(messageValue)
          Message1ListBuffer
        })
        inputDStream1.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      }
    })


    inputDStream2.foreachRDD(rdd => {
      if (!rdd.partitions.isEmpty) {
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        rdd.map({ msg =>
          val r = msg.value()
          val avro = AvroUtils.objectToAvro(r.getSchema, r)
          val messageValue = AvroInputStream.json[FMessage2](avro.getBytes("UTF-8")).singleEntity.get
          Message2ListBuffer = Message1FlatMapper.flatmap(messageValue)
          Message2ListBuffer

        })
        inputDStream2.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      }
    })

I thought I could return Message1ListBuffer and Message2ListBuffer, turn them into dataframes and join them. 我以为我可以返回Message1ListBuffer和Message2ListBuffer，将它们转换为数据帧并加入它们。 But that does not work, and I do not think it's the best choice 但这是行不通的，我认为这不是最佳选择

From there, what is the way to return the rdd of each foreachRDD in order to make a join? 从那里，返回每个foreachRDD的rdd以便进行联接的方法是什么？

inputDStream1.foreachRDD(rdd => {

})


inputDStream2.foreachRDD(rdd => {

})

Answer 1

Not sure about the Spark version you are using, with Spark 2.3+, it can be achieved directly. 不确定您使用的Spark版本是否为Spark 2.3+，可以直接实现。

With Spark >= 2.3 使用Spark> = 2.3

Subscribe to 2 topics you want to join 订阅2个您想加入的主题

val ds1 = spark
  .readStream 
  .format("kafka")
  .option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
  .option("subscribe", "source-topic1")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load

val ds2 = spark
  .readStream 
  .format("kafka")
  .option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
  .option("subscribe", "source-topic2")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load

Format the subscribed messages in both streams 格式化两个流中的已订阅消息

val stream1 = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

val stream2 = ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

Join both the streams 加入两个流

resultStream = stream1.join(stream2)

more join operations here 这里更多的加入操作

Warning: 警告：

Delay records will not get a join match. 延迟记录将不会获得联接匹配。 Need to tweak buffer a bit. 需要调整缓冲一点。 more information found here 在这里找到更多信息

Spark Streaming加入Kafka主题

问题描述

1 个解决方案

解决方案1
1 2019-05-24 18:17:00

With Spark >= 2.3 使用Spark> = 2.3

Subscribe to 2 topics you want to join 订阅2个您想加入的主题

Format the subscribed messages in both streams 格式化两个流中的已订阅消息

Join both the streams 加入两个流

Spark Streaming加入Kafka主题

问题描述

1 个解决方案

解决方案1 1 2019-05-24 18:17:00

With Spark >= 2.3 使用Spark> = 2.3

Subscribe to 2 topics you want to join 订阅2个您想加入的主题

Format the subscribed messages in both streams 格式化两个流中的已订阅消息

Join both the streams 加入两个流

解决方案1
1 2019-05-24 18:17:00