简体   繁体   English

Spark-Prediction.io-scala.MatchError:空

[英]Spark - Prediction.io - scala.MatchError: null

I'm working on a template for prediction.io and I'm running into trouble with Spark. 我正在为prediction.io设计模板,并且遇到了Spark的麻烦。

I keep getting a scala.MatchError error: full gist here 我不断收到scala.MatchError错误: 这里的要点是

scala.MatchError: null
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:831)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:66)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:86)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:79)
at scala.Option.map(Option.scala:145)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:79)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:78)

The code github source here 代码github源代码在这里

val usersWithCounts =
  ratingsRDD
    .map(r => (r.user, (1, Seq[Rating](Rating(r.user, r.item, r.rating)))))
    .reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2.union(v2._2)))
    .filter(_._2._1 >= evalK)

// create evalK folds of ratings
(0 until evalK).map { idx =>
  // start by getting this fold's ratings for each user
  val fold = usersWithCounts
    .map { userKV =>
      val userRatings = userKV._2._2.zipWithIndex
      val trainingRatings = userRatings.filter(_._2 % evalK != idx).map(_._1)
      val testingRatings = userRatings.filter(_._2 % evalK == idx).map(_._1)
      (trainingRatings, testingRatings) // split the user's ratings into a training set and a testing set
    }
    .reduce((l, r) => (l._1.union(r._1), l._2.union(r._2))) // merge all the testing and training sets into a single testing and training set

  val testingSet = fold._2.map {
    r => (new Query(r.user, r.item), new ActualResult(r.rating))
  }

  (
    new TrainingData(sc.parallelize(fold._1)),
    new EmptyEvaluationInfo(),
    sc.parallelize(testingSet)
  )

}

In order to do evaluation I need to split the ratings into a training and a testing group. 为了进行评估,我需要将等级分为培训和测试小组。 To make sure each user has been included as part of the training, I group all the user's ratings together and then do the split on each user and then join the splits together. 为了确保将每个用户都纳入培训范围,我将所有用户的评分分组在一起,然后对每个用户进行拆分,然后将拆分加入在一起。

Maybe there's a better way to do this? 也许有更好的方法可以做到这一点?

The error means that the userFeatures of the MLlib MatrixFactorizationModel doesn't contain the user id (say, if the user is not in training data). 该错误意味着MLlib MatrixFactorizationModel的userFeatures不包含用户ID(例如,如果用户不在训练数据中)。 MLlib doesn't check for this after the lookup (.head is used): https://github.com/apache/spark/blob/v1.2.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L66 MLlib在查找(使用.head)之后不检查此内容: https : //github.com/apache/spark/blob/v1.2.0/mllib/src/main/scala/org/apache/spark/mllib /recommendation/MatrixFactorizationModel.scala#L66

To debug if it's the case, you can implement a modified version of model.predict() to check if userId/itemId exists in model instead of calling the default one: 要进行调试,可以实现对model.predict()的修改版本,以检查模型中是否存在userId / itemId而不是调用默认值:

val itemScore = model.predict(userInt, itemInt) 

( https://github.com/nickpoorman/template-scala-parallel-prediction/blob/master/src/main/scala/ALSAlgorithm.scala#L80 ): https://github.com/nickpoorman/template-scala-parallel-prediction/blob/master/src/main/scala/ALSAlgorithm.scala#L80 ):

Change to use .headOption: 更改为使用.headOption:

val itemScore = model.userFeatures.lookup(userInt).headOption.map { userFeature =>
  model.productFeatures.lookup(itemInt).headOption.map { productFeature =>
    val userVector = new DoubleMatrix(userFeature)
    val productVector = new DoubleMatrix(productFeature)
    userVector.dot(productVector)
  }.getOrElse{
     logger.info(s"No itemFeature for item ${query.item}.")
     0.0 // return default score
  }
}.getOrElse{
   logger.info(s"No userFeature for user ${query.user}.")
   0.0 // return default score
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM