简体   繁体   中英

Spark - Prediction.io - scala.MatchError: null

I'm working on a template for prediction.io and I'm running into trouble with Spark.

I keep getting a scala.MatchError error: full gist here

scala.MatchError: null
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:831)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:66)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:86)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:79)
at scala.Option.map(Option.scala:145)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:79)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:78)

The code github source here

val usersWithCounts =
  ratingsRDD
    .map(r => (r.user, (1, Seq[Rating](Rating(r.user, r.item, r.rating)))))
    .reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2.union(v2._2)))
    .filter(_._2._1 >= evalK)

// create evalK folds of ratings
(0 until evalK).map { idx =>
  // start by getting this fold's ratings for each user
  val fold = usersWithCounts
    .map { userKV =>
      val userRatings = userKV._2._2.zipWithIndex
      val trainingRatings = userRatings.filter(_._2 % evalK != idx).map(_._1)
      val testingRatings = userRatings.filter(_._2 % evalK == idx).map(_._1)
      (trainingRatings, testingRatings) // split the user's ratings into a training set and a testing set
    }
    .reduce((l, r) => (l._1.union(r._1), l._2.union(r._2))) // merge all the testing and training sets into a single testing and training set

  val testingSet = fold._2.map {
    r => (new Query(r.user, r.item), new ActualResult(r.rating))
  }

  (
    new TrainingData(sc.parallelize(fold._1)),
    new EmptyEvaluationInfo(),
    sc.parallelize(testingSet)
  )

}

In order to do evaluation I need to split the ratings into a training and a testing group. To make sure each user has been included as part of the training, I group all the user's ratings together and then do the split on each user and then join the splits together.

Maybe there's a better way to do this?

The error means that the userFeatures of the MLlib MatrixFactorizationModel doesn't contain the user id (say, if the user is not in training data). MLlib doesn't check for this after the lookup (.head is used): https://github.com/apache/spark/blob/v1.2.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L66

To debug if it's the case, you can implement a modified version of model.predict() to check if userId/itemId exists in model instead of calling the default one:

val itemScore = model.predict(userInt, itemInt) 

( https://github.com/nickpoorman/template-scala-parallel-prediction/blob/master/src/main/scala/ALSAlgorithm.scala#L80 ):

Change to use .headOption:

val itemScore = model.userFeatures.lookup(userInt).headOption.map { userFeature =>
  model.productFeatures.lookup(itemInt).headOption.map { productFeature =>
    val userVector = new DoubleMatrix(userFeature)
    val productVector = new DoubleMatrix(productFeature)
    userVector.dot(productVector)
  }.getOrElse{
     logger.info(s"No itemFeature for item ${query.item}.")
     0.0 // return default score
  }
}.getOrElse{
   logger.info(s"No userFeature for user ${query.user}.")
   0.0 // return default score
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM