[英]read from hdfs write to kafka using scala spark,but get NullPointerException
我想使用 spark rdd 从 hdfs 读取文本文件,并通过 foreach 写入 kafka。代码如下
def main(args: Array[String]): Unit = {
val kafkaItemProducerConfig = {
val p = new Properties()
p.put("bootstrap.servers", KAFKA_SERVER)
p.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
p.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
p.put("client.id", KAFKA_ITEM_CLIENT_ID)
p.put("acks", "all")
p
}
val conf: SparkConf = new SparkConf().setAppName("esc")
val sc: SparkContext = new SparkContext(conf)
kafkaItemProducer = new KafkaProducer[String, String](kafkaItemProducerConfig)
if (kafkaItemProducer == null) {
println("kafka config is error")
sys.exit(1)
}
val dataToDmp = sc.textFile("/home/********/" + args(0) +"/part*")
dataToDmp.foreach(x => {
if (x != null && !x.isEmpty) {
kafkaItemProducer.send(new ProducerRecord[String, String](KAFKA_TOPIC_ITEM, x.toString))
}
}
)
kafkaItemProducer.close()
}
我很确定 KAFKA_SERVER 和 KAFKA_ITEM_CLIENT_ID 和 KAFKA_TOPIC_ITEM 是正确的。但是它出错了:
ERROR ApplicationMaster [Driver]: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 0.0 failed 4 times, most recent failure: Lost task 12.3 in stage 0.0 (TID 18, tjtx148-5-173.58os.org, executor 1): java.lang.NullPointerException
at esc.HdfsWriteToKafks$$anonfun$main$1.apply(HdfsWriteToKafks.scala:56)
at esc.HdfsWriteToKafks$$anonfun$main$1.apply(HdfsWriteToKafks.scala:53)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
它说
at esc.HdfsWriteToKafks$$anonfun$main$1.apply(HdfsWriteToKafks.scala:56)
at esc.HdfsWriteToKafks$$anonfun$main$1.apply(HdfsWriteToKafks.scala:53)
所以,56行有错误
kafkaItemProducer.send(new ProducerRecord[String, String](KAFKA_TOPIC_ITEM, x.toString))
53线是
dataToDmp.foreach(
我检查了dataToDmp有内容,我用dataToDmp.take(100).foreach(println(_))
检查,是正确的。
我的代码中有错误吗?
它起作用了。我改变了我使用foreachpartition方法而不是foreach。在每个分区中,我创建了一个生产者。 代码如下:
def main(args: Array[String]): Unit = {
var kafkaItemProducer: KafkaProducer[String, String] = null
val KAFKA_ITEM_CLIENT_ID = "a"
val KAFKA_TOPIC_ITEM = "b"
val KAFKA_SERVER = "host:port"
val kafkaItemProducerConfig = {
val p = new Properties()
p.put("bootstrap.servers", KAFKA_SERVER)
p.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
p.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
p.put("client.id", KAFKA_ITEM_CLIENT_ID)
p.put("acks", "all")
p
}
val conf: SparkConf = new SparkConf().setAppName("esc")
val sc: SparkContext = new SparkContext(conf)
val dataToDmp = sc.textFile("/home/****/" + args(0) +"/part*",5)
dataToDmp.foreachPartition(partition =>{
val kafkaItemProducer = new KafkaProducer[String, String](kafkaItemProducerConfig)
partition.foreach(
x=>{ kafkaItemProducer.send(new ProducerRecord[String, String](KAFKA_TOPIC_ITEM, x.toString))
Thread.sleep(100)
}
)
kafkaItemProducer.close()
})
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.