简体   繁体   English

将数据从 Spark Structured Streaming 加载到 ArrayList

[英]Loading data from Spark Structured Streaming into ArrayList

I need to send data from Kafka to Kinesis Firehose.我需要将数据从 Kafka 发送到 Kinesis Firehose。 I am processing Kafka data using Spark Structured Streaming.我正在使用 Spark Structured Streaming 处理 Kafka 数据。 I am not sure how to process the dataset of streaming query into an ArrayList variable - say, recordList - of eg 100 records (could be any other value) and then call Firehose API's putRecordBatch(recordList) to put the records into Firehose.我不确定如何将流式查询的数据集处理为ArrayList变量 - 例如, recordList - 例如 100 条记录(可以是任何其他值),然后调用 Firehose API 的putRecordBatch(recordList)将记录放入 Firehose。

I think you want to check out Foreach and ForeachBatch depending on your Spark Version.我认为您想根据您的 Spark 版本查看Foreach 和 ForeachBatch ForeachBatch comes in V2.4.0 and foreach is available < V2.4.0. ForeachBatch 出现在 V2.4.0 中,foreach 可用 < V2.4.0。 If there is no streaming sink implementation available for Kinesis Firehouse then you should make your own implementation of the ForeachWriter .如果 Kinesis Firehouse 没有可用的流式接收器实现,那么您应该自己实现ForeachWriter Databricks has some nice examples of using foreach to create the custom writers. Databricks 有一些使用 foreach 创建自定义编写器的很好的例子

I haven't ever used Kinesis but here is an example of what your custom sink might look like.我从未使用过 Kinesis,但这里有一个示例,说明您的自定义接收器可能是什么样子。

case class MyConfigInfo(info1: String, info2: String)

class  KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
  val kinesisProducer = _

  def open(partitionId: Long,version: Long): Boolean = {
    kinesisProducer = //set up the kinesis producer using MyConfigInfo
      true
  }

  def process(value: (String, String)): Unit = {
    //ask kinesisProducer to send data
  }

  def close(errorOrNull: Throwable): Unit = {
    //close the kinesis producer
  }
}

if you're using that AWS kinesisfirehose API you might do something like this如果您使用的是 AWS kinesisfirehose API,您可能会执行以下操作

case class MyConfigInfo(info1: String, info2: String)

class  KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
  val firehoseClient = _
  val req = putRecordBatchRequest = new PutRecordBatchRequest()
  val records = 0
  val recordLimit = //maybe you need to set this? 

  def open(partitionId: Long,version: Long): Boolean = {
    firehoseClient = //set up the firehose client using MyConfigInfo
      true
  }

  def process(value: (String, String)): Unit = {
    //ask fireHose client to send data or batch the request
    val record: Record = //create Record out of value
    req.setRecords(record)
    records = records + 1
    if(records >= recordLimit) {
      firehoseClient.putRecordBatch(req)
      records = 0
    }
  }

  def close(errorOrNull: Throwable): Unit = {
    //close the firehose client
    //or instead you could put the batch request to the firehose client here but i'm not sure if that's good practice
  }
}

Then you'd use it as such然后你会像这样使用它

val writer = new KinesisSink(configuration)
val query =
  streamingSelectDF
    .writeStream
    .foreach(writer)
    .outputMode("update")
    .trigger(ProcessingTime("25 seconds"))
    .start()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM