将数据从 Spark Structured Streaming 加载到 ArrayList

Question

I need to send data from Kafka to Kinesis Firehose.我需要将数据从 Kafka 发送到 Kinesis Firehose。 I am processing Kafka data using Spark Structured Streaming.我正在使用 Spark Structured Streaming 处理 Kafka 数据。 I am not sure how to process the dataset of streaming query into an ArrayList variable - say, recordList - of eg 100 records (could be any other value) and then call Firehose API's putRecordBatch(recordList) to put the records into Firehose.我不确定如何将流式查询的数据集处理为ArrayList变量 - 例如， recordList - 例如 100 条记录（可以是任何其他值），然后调用 Firehose API 的putRecordBatch(recordList)将记录放入 Firehose。

Answer 1

I think you want to check out Foreach and ForeachBatch depending on your Spark Version.我认为您想根据您的 Spark 版本查看Foreach 和 ForeachBatch 。 ForeachBatch comes in V2.4.0 and foreach is available < V2.4.0. ForeachBatch 出现在 V2.4.0 中，foreach 可用 < V2.4.0。 If there is no streaming sink implementation available for Kinesis Firehouse then you should make your own implementation of the ForeachWriter .如果 Kinesis Firehouse 没有可用的流式接收器实现，那么您应该自己实现ForeachWriter 。 Databricks has some nice examples of using foreach to create the custom writers. Databricks 有一些使用 foreach 创建自定义编写器的很好的例子。

I haven't ever used Kinesis but here is an example of what your custom sink might look like.我从未使用过 Kinesis，但这里有一个示例，说明您的自定义接收器可能是什么样子。

case class MyConfigInfo(info1: String, info2: String)

class  KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
  val kinesisProducer = _

  def open(partitionId: Long,version: Long): Boolean = {
    kinesisProducer = //set up the kinesis producer using MyConfigInfo
      true
  }

  def process(value: (String, String)): Unit = {
    //ask kinesisProducer to send data
  }

  def close(errorOrNull: Throwable): Unit = {
    //close the kinesis producer
  }
}

if you're using that AWS kinesisfirehose API you might do something like this如果您使用的是 AWS kinesisfirehose API，您可能会执行以下操作

case class MyConfigInfo(info1: String, info2: String)

class  KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
  val firehoseClient = _
  val req = putRecordBatchRequest = new PutRecordBatchRequest()
  val records = 0
  val recordLimit = //maybe you need to set this? 

  def open(partitionId: Long,version: Long): Boolean = {
    firehoseClient = //set up the firehose client using MyConfigInfo
      true
  }

  def process(value: (String, String)): Unit = {
    //ask fireHose client to send data or batch the request
    val record: Record = //create Record out of value
    req.setRecords(record)
    records = records + 1
    if(records >= recordLimit) {
      firehoseClient.putRecordBatch(req)
      records = 0
    }
  }

  def close(errorOrNull: Throwable): Unit = {
    //close the firehose client
    //or instead you could put the batch request to the firehose client here but i'm not sure if that's good practice
  }
}

Then you'd use it as such然后你会像这样使用它

val writer = new KinesisSink(configuration)
val query =
  streamingSelectDF
    .writeStream
    .foreach(writer)
    .outputMode("update")
    .trigger(ProcessingTime("25 seconds"))
    .start()

将数据从 Spark Structured Streaming 加载到 ArrayList

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-08-08 18:55:07

将数据从 Spark Structured Streaming 加载到 ArrayList

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-08-08 18:55:07

解决方案1
0 已采纳 2019-08-08 18:55:07