![](/img/trans.png)
[英]How to do writeStream a dataframe in console? (Scala Spark Streaming)
[英]How to create a KinesisSink for Spark Structural Streaming
我在 Databricks 上使用 Spark 2.2 並嘗試實現一個 Kinesis 接收器以從 Spark 寫入 Kinesis 流。
我正在使用此處提供的示例https://docs.databricks.com/_static/notebooks/structured-streaming-kinesis-sink.html
/**
* A simple Sink that writes to the given Amazon Kinesis `stream` in the given `region`. For authentication, users may provide
* `awsAccessKey` and `awsSecretKey`, or use IAM Roles when launching their cluster.
*
* This Sink takes a two column Dataset, with the columns being the `partitionKey`, and the `data` respectively.
* We will buffer data up to `maxBufferSize` before flushing to Kinesis in order to reduce cost.
*/
class KinesisSink(
stream: String,
region: String,
awsAccessKey: Option[String] = None,
awsSecretKey: Option[String] = None) extends ForeachWriter[(String, Array[Byte])] {
// Configurations
private val maxBufferSize = 500 * 1024 // 500 KB
private var client: AmazonKinesis = _
private val buffer = new ArrayBuffer[PutRecordsRequestEntry]()
private var bufferSize: Long = 0L
override def open(partitionId: Long, version: Long): Boolean = {
client = createClient
true
}
override def process(value: (String, Array[Byte])): Unit = {
val (partitionKey, data) = value
// Maximum of 500 records can be sent with a single `putRecords` request
if ((data.length + bufferSize > maxBufferSize && buffer.nonEmpty) || buffer.length == 500) {
flush()
}
buffer += new PutRecordsRequestEntry().withPartitionKey(partitionKey).withData(ByteBuffer.wrap(data))
bufferSize += data.length
}
override def close(errorOrNull: Throwable): Unit = {
if (buffer.nonEmpty) {
flush()
}
client.shutdown()
}
/** Flush the buffer to Kinesis */
private def flush(): Unit = {
val recordRequest = new PutRecordsRequest()
.withStreamName(stream)
.withRecords(buffer: _*)
client.putRecords(recordRequest)
buffer.clear()
bufferSize = 0
}
/** Create a Kinesis client. */
private def createClient: AmazonKinesis = {
val cli = if (awsAccessKey.isEmpty || awsSecretKey.isEmpty) {
AmazonKinesisClientBuilder.standard()
.withRegion(region)
.build()
} else {
AmazonKinesisClientBuilder.standard()
.withRegion(region)
.withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(awsAccessKey.get, awsSecretKey.get)))
.build()
}
cli
}
}
然后我使用實現 KinesisSink 類
val kinesisSink = new KinesisSink("us-east-1", "MyStream", Option("xxx..."), Option("xxx..."))
最后我使用這個接收器創建了一個流。 此 KinesisSink 采用兩列數據集,列分別為partitionKey
和data
。
case class MyData(partitionKey: String, data: Array[Byte])
val newsDataDF = kinesisDF
.selectExpr("apinewsseqid", "fullcontent").as[MyData]
.writeStream
.outputMode("append")
.foreach(kinesisSink)
.start
但我仍然收到以下錯誤
error: type mismatch;
found : KinesisSink
required: org.apache.spark.sql.ForeachWriter[MyData]
.foreach(kinesisSink)
您需要更改KinesisSink.process
方法的簽名,它應該采用您的自定義MyData
對象,然后從那里提取partitionKey
和data
。
我使用了由 databricks 提供的完全相同的 KinesisSink 並通過使用創建數據集使其工作
val dataset = df.selectExpr("CAST(rand() AS STRING) as partitionKey","message_bytes").as[(String, Array[Byte])]
並使用數據集寫入運動流
val query = dataset
.writeStream
.foreach(kinesisSink)
.start()
.awaitTermination()
當我使用dataset.selectExpr("partitionKey","message_bytes")
我得到了相同的類型不匹配錯誤:
錯誤:類型不匹配;
發現:KinesisSink
必需:org.apache.spark.sql.ForeachWriter[(String, Array[Byte])]
.foreach(kinesisSink)
在這種情況下不需要selectExpr
,因為數據集驅動 ForeachWriter 的數據類型。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.