简体   繁体   English

如何使用spark将来自kafka主题的流数据写入hdfs?

[英]How can I use spark to writeStream data from a kafka topic into hdfs?

I have been trying to get this code to work for hours: 我一直在尝试使此代码工作数小时:

val spark = SparkSession.builder() 
.appName("Consumer") 
.getOrCreate() 

spark.readStream 
.format("kafka") 
.option("kafka.bootstrap.servers", url) 
.option("subscribe", topic) 
.load() 
.select("value") 
.writeStream 
.format(fileFormat) 
.option("path", filePath) 
.option("checkpointLocation", "/tmp/checkpoint") 
.start() 
.awaitTermination() 

it gives this exception: 它给出了以下异常:

Logical Plan: 
Project [value#8] 
+- StreamingExecutionRelation KafkaV2[Subscribe[MyTopic]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13] 

at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) 
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) 
Caused by: java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) 
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
at scala.collection.Iterator$class.foreach(Iterator.scala:893) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) 
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) 
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) 
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) 
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) 
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) 
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) 
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) 
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) 
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)

I don't understand what's going on, I am simply trying to write data from a kafka topic into HDFS using spark streaming. 我不知道发生了什么,我只是想尝试使用Spark Streaming将来自kafka主题的数据写入HDFS。 Why is this so hard? 为什么这么难? And how can I do it? 我该怎么办呢?

I got the batching version to work just fine: 我得到的批处理版本可以正常工作:

spark.read 
.format("kafka") 
.option("kafka.bootstrap.servers", url) 
.option("subscribe", topic) 
.load() 
.selectExpr("CAST(value AS String)") 
.write 
.format(fileFormat) 
.save(filePath)

@happy You are encountering a known bug in structured streaming https://issues.apache.org/jira/browse/SPARK-25257 @happy您在结构化流媒体中遇到了一个已知错误https://issues.apache.org/jira/browse/SPARK-25257

This is because the offset from disk is never deserialized and the fix will be merged in coming release 这是因为从磁盘的偏移量不会反序列化,并且此修复程序将在以后的发行版中合并

我将spark版本更改为2.3.2后,一切开始正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM