简体   繁体   中英

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala. I was trying to implement collaborative filtering using scala on spark. Below is the code

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile("/user/amohammed/CB/input-cb.txt")

val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)

val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)

val ratings = data.map(_.split(',') match {
  case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})

val model = ALS.train(ratings, 1, 20, 0.01)

val keywords = distinctKeywords collect
  distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()

It throws a scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:

val users = distinctUsers collect
  users.map(x => {(x, keywords.map(y => model.predict(x, y)))})

Where am I getting it wrong when dealing with RDDs?

Spark Version : 1.0.0 Scala Version : 2.10.4

Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:

val userVector = new DoubleMatrix(userFeatures.lookup(user).head)

Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model , and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.

Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.

I'd start with this:

val ratings = data.map(_.split(',') match {
  case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})

// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class

val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()

val model = ALS.train(ratings, 1, 20, 0.01)

Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.

val userKeywords = distinctUsers.cartesian(distinctKeywords)

val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
  (user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }

Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:

val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))

Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM