简体   繁体   中英

How can I use spark to writeStream data from a kafka topic into hdfs?

I have been trying to get this code to work for hours:

val spark = SparkSession.builder() 
.appName("Consumer") 
.getOrCreate() 

spark.readStream 
.format("kafka") 
.option("kafka.bootstrap.servers", url) 
.option("subscribe", topic) 
.load() 
.select("value") 
.writeStream 
.format(fileFormat) 
.option("path", filePath) 
.option("checkpointLocation", "/tmp/checkpoint") 
.start() 
.awaitTermination() 

it gives this exception:

Logical Plan: 
Project [value#8] 
+- StreamingExecutionRelation KafkaV2[Subscribe[MyTopic]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13] 

at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) 
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) 
Caused by: java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) 
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) 
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) 
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) 
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) 
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) 
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) 
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) 
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) 
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)

I don't understand what's going on, I am simply trying to write data from a kafka topic into HDFS using spark streaming. Why is this so hard? And how can I do it?

I got the batching version to work just fine:

spark.read 
.format("kafka") 
.option("kafka.bootstrap.servers", url) 
.option("subscribe", topic) 
.load() 
.selectExpr("CAST(value AS String)") 
.write 
.format(fileFormat) 
.save(filePath)

@happy You are encountering a known bug in structured streaming https://issues.apache.org/jira/browse/SPARK-25257

This is because the offset from disk is never deserialized and the fix will be merged in coming release

我将spark版本更改为2.3.2后,一切开始正常工作。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM