scala.MatchError：在spark RDD上为null

Question

我对spark和scala都比较新。 我试图在spark上使用scala实现协同过滤。 下面是代码

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile("/user/amohammed/CB/input-cb.txt")

val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)

val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)

val ratings = data.map(_.split(',') match {
  case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})

val model = ALS.train(ratings, 1, 20, 0.01)

val keywords = distinctKeywords collect
  distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()

它在最后一行抛出一个scala.MatchError：null org.apache.spark.rdd.PairRDDFunctions.lookup（PairRDDFunctions.scala：571）如果我将distinctUsers rdd收集到一个数组中并执行相同的代码，那么Thw代码工作正常：

val users = distinctUsers collect
  users.map(x => {(x, keywords.map(y => model.predict(x, y)))})

在处理RDD时，我在哪里弄错了？

Spark版本：1.0.0 Scala版本：2.10.4

Answer 1

在堆栈跟踪中进一步调用， MatrixFactorizationModel源代码的第43行说：

val userVector = new DoubleMatrix(userFeatures.lookup(user).head)

注意， model的userFeatures字段本身就是另一个RDD; 我相信当匿名功能块关闭model时它没有正确序列化，因此它上面的lookup方法失败了。 我也尝试将model和keywords放入广播变量中，但这也不起作用。

而不是回到Scala集合并失去Spark的好处，坚持使用RDD并利用其他方式转换它们可能更好。

我从这开始：

val ratings = data.map(_.split(',') match {
  case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})

// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class

val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()

val model = ALS.train(ratings, 1, 20, 0.01)

然后，我们不是逐个计算每个预测，而是可以获得所有可能的用户 - 关键字对的笛卡尔乘积作为RDD，并使用MatrixFactorizationModel中的另一个predict方法，该方法将这些对的RDD作为其参数。

val userKeywords = distinctUsers.cartesian(distinctKeywords)

val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
  (user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }

现在， predictions为每个用户提供了一个不可变的映射，可以查询特定关键字的预测评级。 如果您在原始示例中特别需要数组，则可以执行以下操作：

val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))

警告：我用Spark 1.0.1对它进行了测试，因为它是我安装的，但它也适用于1.0.0。

scala.MatchError：在spark RDD上为null

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-07-15 03:05:24

scala.MatchError：在spark RDD上为null

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-07-15 03:05:24

解决方案1
1 已采纳 2014-07-15 03:05:24