scala.MatchError：在spark RDD上为null

Question

I am relatively new to both spark and scala. 我对spark和scala都比较新。 I was trying to implement collaborative filtering using scala on spark. 我试图在spark上使用scala实现协同过滤。 Below is the code 下面是代码

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile("/user/amohammed/CB/input-cb.txt")

val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)

val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)

val ratings = data.map(_.split(',') match {
  case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})

val model = ALS.train(ratings, 1, 20, 0.01)

val keywords = distinctKeywords collect
  distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()

It throws a scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code: 它在最后一行抛出一个scala.MatchError：null org.apache.spark.rdd.PairRDDFunctions.lookup（PairRDDFunctions.scala：571）如果我将distinctUsers rdd收集到一个数组中并执行相同的代码，那么Thw代码工作正常：

val users = distinctUsers collect
  users.map(x => {(x, keywords.map(y => model.predict(x, y)))})

Where am I getting it wrong when dealing with RDDs? 在处理RDD时，我在哪里弄错了？

Spark Version : 1.0.0 Scala Version : 2.10.4 Spark版本：1.0.0 Scala版本：2.10.4

Answer 1

Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says: 在堆栈跟踪中进一步调用， MatrixFactorizationModel源代码的第43行说：

val userVector = new DoubleMatrix(userFeatures.lookup(user).head)

Note that the userFeatures field of model is itself another RDD; 注意， model的userFeatures字段本身就是另一个RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model , and thus the lookup method on it is failing. 我相信当匿名功能块关闭model时它没有正确序列化，因此它上面的lookup方法失败了。 I also tried placing both model and keywords into broadcast variables, but that didn't work either. 我也尝试将model和keywords放入广播变量中，但这也不起作用。

Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them. 而不是回到Scala集合并失去Spark的好处，坚持使用RDD并利用其他方式转换它们可能更好。

I'd start with this: 我从这开始：

val ratings = data.map(_.split(',') match {
  case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})

// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class

val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()

val model = ALS.train(ratings, 1, 20, 0.01)

Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument. 然后，我们不是逐个计算每个预测，而是可以获得所有可能的用户 - 关键字对的笛卡尔乘积作为RDD，并使用MatrixFactorizationModel中的另一个predict方法，该方法将这些对的RDD作为其参数。

val userKeywords = distinctUsers.cartesian(distinctKeywords)

val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
  (user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }

Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. 现在， predictions为每个用户提供了一个不可变的映射，可以查询特定关键字的预测评级。 If you specifically want arrays as in your original example, you can do: 如果您在原始示例中特别需要数组，则可以执行以下操作：

val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))

Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well. 警告：我用Spark 1.0.1对它进行了测试，因为它是我安装的，但它也适用于1.0.0。

scala.MatchError：在spark RDD上为null

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-07-15 03:05:24

scala.MatchError：在spark RDD上为null

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-07-15 03:05:24

解决方案1
1 已采纳 2014-07-15 03:05:24