简体   繁体   English

如何为 Spark Structural Streaming 创建 KinesisSink

[英]How to create a KinesisSink for Spark Structural Streaming

I am using Spark 2.2 on Databricks and trying to implement a Kinesis sink to write from Spark to a Kinesis stream.我在 Databricks 上使用 Spark 2.2 并尝试实现一个 Kinesis 接收器以从 Spark 写入 Kinesis 流。

I am using the following provide sample from here https://docs.databricks.com/_static/notebooks/structured-streaming-kinesis-sink.html我正在使用此处提供的示例https://docs.databricks.com/_static/notebooks/structured-streaming-kinesis-sink.html

/**
* A simple Sink that writes to the given Amazon Kinesis `stream` in the given    `region`. For authentication, users may provide
* `awsAccessKey` and `awsSecretKey`, or use IAM Roles when launching their cluster.
*
* This Sink takes a two column Dataset, with the columns being the `partitionKey`, and the `data` respectively.
* We will buffer data up to `maxBufferSize` before flushing to Kinesis in order to reduce cost.
*/
class KinesisSink(
    stream: String,
    region: String,
    awsAccessKey: Option[String] = None,
    awsSecretKey: Option[String] = None) extends ForeachWriter[(String, Array[Byte])] { 

 // Configurations
 private val maxBufferSize = 500 * 1024 // 500 KB

 private var client: AmazonKinesis = _
 private val buffer = new ArrayBuffer[PutRecordsRequestEntry]()
 private var bufferSize: Long = 0L

 override def open(partitionId: Long, version: Long): Boolean = {
   client = createClient
   true
 }

 override def process(value: (String, Array[Byte])): Unit = {
   val (partitionKey, data) = value
   // Maximum of 500 records can be sent with a single `putRecords` request
   if ((data.length + bufferSize > maxBufferSize && buffer.nonEmpty) ||    buffer.length == 500) {
  flush()
}
buffer += new PutRecordsRequestEntry().withPartitionKey(partitionKey).withData(ByteBuffer.wrap(data))
bufferSize += data.length
 }

 override def close(errorOrNull: Throwable): Unit = {
   if (buffer.nonEmpty) {
     flush()
   }
   client.shutdown()
 }

 /** Flush the buffer to Kinesis */
 private def flush(): Unit = {
   val recordRequest = new PutRecordsRequest()
     .withStreamName(stream)
     .withRecords(buffer: _*)

   client.putRecords(recordRequest)
   buffer.clear()
   bufferSize = 0
 }

 /** Create a Kinesis client. */
 private def createClient: AmazonKinesis = {
   val cli = if (awsAccessKey.isEmpty || awsSecretKey.isEmpty) {
     AmazonKinesisClientBuilder.standard()
       .withRegion(region)
       .build()
   } else {
     AmazonKinesisClientBuilder.standard()
       .withRegion(region)
       .withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(awsAccessKey.get, awsSecretKey.get)))
       .build()
   }
   cli
 }
}

I then implement the KinesisSink class using然后我使用实现 KinesisSink 类

val kinesisSink = new KinesisSink("us-east-1", "MyStream", Option("xxx..."), Option("xxx..."))

Finally i create a stream using this sink.最后我使用这个接收器创建了一个流。 This KinesisSink takes a two column Dataset, with the columns being the partitionKey , and the data respectively.此 KinesisSink 采用两列数据集,列分别为partitionKeydata

case class MyData(partitionKey: String, data: Array[Byte])

val newsDataDF = kinesisDF
   .selectExpr("apinewsseqid", "fullcontent").as[MyData]
   .writeStream
   .outputMode("append")
   .foreach(kinesisSink)
   .start

but i am still getting the following error但我仍然收到以下错误

error: type mismatch;
found   : KinesisSink
required: org.apache.spark.sql.ForeachWriter[MyData]
   .foreach(kinesisSink)

You need to change the signature of KinesisSink.process method, it should take your custom MyData object and then extract partitionKey and data from there.您需要更改KinesisSink.process方法的签名,它应该采用您的自定义MyData对象,然后从那里提取partitionKeydata

I used the exact same KinesisSink provided by databricks and got it working by creating dataset using我使用了由 databricks 提供的完全相同的 KinesisSink 并通过使用创建数据集使其工作

val dataset = df.selectExpr("CAST(rand() AS STRING) as partitionKey","message_bytes").as[(String, Array[Byte])]

And used the dataset to write to kinesis stream并使用数据集写入运动流

val query = dataset
.writeStream
.foreach(kinesisSink)
.start()
.awaitTermination()

While I used dataset.selectExpr("partitionKey","message_bytes") I got the same type mismatch error:当我使用dataset.selectExpr("partitionKey","message_bytes")我得到了相同的类型不匹配错误:

error: type mismatch;错误:类型不匹配;
found: KinesisSink发现:KinesisSink
required: org.apache.spark.sql.ForeachWriter[(String, Array[Byte])]必需:org.apache.spark.sql.ForeachWriter[(String, Array[Byte])]
.foreach(kinesisSink) .foreach(kinesisSink)

selectExpr is not needed in this case since the dataset drives the data type of the ForeachWriter.在这种情况下不需要selectExpr ,因为数据集驱动 ForeachWriter 的数据类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在控制台中执行 writeStream a dataframe? (Scala Spark 流媒体) - How to do writeStream a dataframe in console? (Scala Spark Streaming) 如果我使用 Dataproc,它如何处理从 Apache Hadoop 和 Spark 到 Dataproc 的实时流数据? - If I use Dataproc, how does it process real-time streaming data from Apache Hadoop and Spark to Dataproc? Spark Structured Streaming - stderr 被填满 - Spark Structured Streaming - stderr getting filled up 使用火花流从 pubsublite 接收消息时出现问题 - Problems receving messages from pubsublite with spark streaming Spark Streaming 接收器仅处理一条记录 - Spark Streaming receiver is processing only one record 数据流流作业错误:未定义名称“函数”。 如何为所有数据流步骤创建全局函数 - Dataflow Streaming Job Error: name 'function' is not defined. How to create a global function for all Dataflow steps 如何使用 impersonate_service_account 进行流式光束管道 - How to use impersonate_service_account for streaming beam pipeline 如何使用 AWS S3 存储桶,在没有云端的情况下进行视频流传输 - How to use AWS S3 buckets, for video streaming without cloudfront 如何覆盖 spark/BigQuery/GCP 中的特定分区 - How to overwrite specific partitions in spark/BigQuery/GCP 如何在 spark scala 中动态聚合列? - How to aggregate the columns dynamically in spark scala?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM