Unable to parallelize a list in Scala

Question

I am unable to parallelize a list in scala, getting java.lang.NullPointerException

    messages.foreachRDD( rdd => {
       for(avroLine <- rdd){
        val record = Injection.injection.invert(avroLine.getBytes).get
        val field1Value = record.get("username")
        val jsonStrings=Seq(record.toString())
        val newRow = sqlContext.sparkContext.parallelize(Seq(record.toString()))
            }
            })

output

 jsonStrings...List({"username": "user_118", "tweet": "tweet_218", "timestamp": 18})

Exception

    Caused by: java.lang.NullPointerException
at com.capitalone.AvroConsumer$$anonfun$main$1$$anonfun$apply$1.apply(AvroConsumer.scala:83)
at com.capitalone.AvroConsumer$$anonfun$main$1$$anonfun$apply$1.apply(AvroConsumer.scala:74)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:26)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

Thanks in Advance!!

Answer 1

You're trying to create an RDD in the spark worker context. While foreachRDD operates in the driver, the foreach operation you perform on each RDD is distributed to the workers. It seems unlikely that you actually want to create a new RDD for each line of the input stream.

Update after comments:

It's hard to have this discussion in a comment thread where there is no formatting for code. My basic question is why aren't you doing something like this:

val messages: ReceiverInputDStream[String] = RabbitMQUtils.createStream(ssc, rabbitParams)
def toJsonString(message: String): String = SparkUtils.getRecordInjection(QUEUE_NAME).invert(message.getBytes()).get
val jsonStrings: DStream[String] = messages map toJsonString

I haven't bothered to figure out and track down all the libraries you're using (please, next time, submit a MCVE ), so I haven't tried to compile that. But it looks like all you want is to map each input message to a JSON string. Maybe you want to do something fancy with the resulting DStream of Strings but that might be a different question.

Answer 2

def toJsonString(message: String): String = {val record = 

SparkUtils.getRecordInjection(QUEUE_NAME).invert(message.getBytes()).get  }
dStreams.foreachRDD( rdd => {
val jsonStrings = rdd.map (stream =>toJsonString(stream))
val df = sqlContext.read.json(jsonStrings)
df.write.mode("Append").csv("/Users/Documents/kafka-poc/consumer-out/def/")}

Unable to parallelize a list in Scala

Question

2 answers

solution1
0 2017-07-15 06:06:31

solution2
0 2017-07-17 21:43:31

Unable to parallelize a list in Scala

Question

2 answers

solution1 0 2017-07-15 06:06:31

solution2 0 2017-07-17 21:43:31

solution1
0 2017-07-15 06:06:31

solution2
0 2017-07-17 21:43:31