使用Spark的群集中的“ java.lang.NullPointerException”

Question

我试图理解在输入.csv文件上的K-means聚类，该文件由56376行和两列组成，第一列代表id，第二列代表一组单词/此数据的示例为

** 1。 1428951621 do rememb来到米兰2013年4月19日，星期一，星期一15

1429163429 rt windeerlust sehun hyungluhan yessehun做甚至纪念
日复一日**

用于处理此数据的Scala代码如下所示

val inputData = sc.textFile("test.csv")


    // this is a changable parameter for the number of clusters to use for kmeans
   val numClusters = 4;
   // number of iterations for the kmeans
   val numIterations = 10;
   // this is the size of the vectors to be created by Word2Vec this is tunable
   val vectorSize = 600; 
val filtereddata = inputData.filter(!_.isEmpty).
                                map(line=>line.split(",",-1)).
                                map(line=>(line(1),line(1).split(" ").filter(_.nonEmpty)))



val corpus = inputData.filter(!_.isEmpty).
                          map(line=>line.split(",",-1)).
                          map(line=>line(1).split(" ").toSeq)
   val values:RDD[Seq[String]] = filtereddata.map(s=>s._2)
   val keys = filtereddata.map(s=>s._1)
/*******************Word2Vec and normalisation*****************************/
   val w2vec = new Word2Vec().setVectorSize(vectorSize);
   val model = w2vec.fit(corpus)
   val outtest:RDD[Seq[Vector]]= values.map(x=>x.map(m=>try {
             model.transform(m)
           } catch {
           case e: Exception => Vectors.zeros(vectorSize)
           }))
   val convertest = outtest.map(m=>m.map(x=>(x.toArray)))

   val withkey = keys.zip(convertest)
   val filterkey = withkey.filter(!_._2.isEmpty)

  val keysfinal= filterkey.map(x=>x._1)
  val valfinal= filterkey.map(x=>x._2)
  // for each collections of vectors that is one tweet, add the vectors
  val reducetest = valfinal.map(x=>x.reduce((a,b)=>a.zip(b).map(t=>t._1+t._2)))
  val filtertest = reducetest.map(x=>x.map(m=>(m,x.length)).map(m=>m._1/m._2))
  val test = filtertest.map(x=>new DenseVector(x).asInstanceOf[Vector])
   val normalizer =  new Normalizer()
   val data1= test.map(x=>(normalizer.transform(x)))
/*********************Clustering Algorithm***********************************/
   val clusters = KMeans.train(data1,numClusters,numIterations)
   val predictions= clusters.predict(data1)
   val clustercount=  keysfinal.zip(predictions).distinct.map(s=>(s._2,1)).reduceByKey(_+_)
   val result= keysfinal.zip(predictions).distinct
   result.saveAsTextFile(fileToSaveResults)
   val wsse = clusters.computeCost(data1)
   println(s"The number of clusters is $numClusters")
   println("The cluster counts are:")
   println(clustercount.collect().mkString(" "))
   println(s"The wsse is: $wsse")

但是，经过一些迭代后，它将抛出“ java.lang.NullPointerException”并在阶段36退出。错误如下所示：

17/10/07 14:42:10 INFO TaskSchedulerImpl: Adding task set 26.0 with 2 tasks
17/10/07 14:42:10 INFO TaskSetManager: Starting task 0.0 in stage 26.0 (TID 50, localhost, partition 0, ANY, 5149 bytes)
17/10/07 14:42:10 INFO TaskSetManager: Starting task 1.0 in stage 26.0 (TID 51, localhost, partition 1, ANY, 5149 bytes)
17/10/07 14:42:10 INFO Executor: Running task 1.0 in stage 26.0 (TID 51)
17/10/07 14:42:10 INFO Executor: Running task 0.0 in stage 26.0 (TID 50)
17/10/07 14:42:10 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
17/10/07 14:42:10 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
17/10/07 14:42:10 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
17/10/07 14:42:10 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/10/07 14:42:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
17/10/07 14:42:10 ERROR Executor: Exception in task 0.0 in stage 26.0 (TID 50)
java.lang.NullPointerException
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)

我无法理解，请帮助我将此代码本地化。 注意：此代码由其他人编写

Answer 1

我认为这与您的代码无关。 如果传递给ProcessBuilder的参数之一为null则抛出此异常。 所以我想这可能是配置问题或Hadoop中的错误。

通过快速搜索“ hadoop java.lang.ProcessBuilder.start nullpointerexception”，看来这是一个已知问题：

https://www.fachschaft.informatik.tu-darmstadt.de/forum/viewtopic.php?t=34250

没有Cygwin，是否可以在Windows上以本地模式运行Hadoop作业（如WordCount示例）？

使用Spark的群集中的“ java.lang.NullPointerException”

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-10-07 20:56:28

使用Spark的群集中的“ java.lang.NullPointerException”

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-10-07 20:56:28

解决方案1
0 已采纳 2017-10-07 20:56:28