如何在其他RDDs map方法中使用RDD？

Question

I got a rdd named index: RDD[(String, String)], I want to use index to deal with my file. 我得到了一个名为rdd的索引：RDD [（String，String）]，我想用index来处理我的文件。 This is the code: 这是代码：

val get = file.map({x =>
  val tmp = index.lookup(x).head
  tmp
})

The question is that I can not use index in the file.map Function, I ran this program and it gave me feedback like this: 问题是我不能在file.map函数中使用索引，我运行了这个程序，它给了我这样的反馈：

14/12/11 16:22:27 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 602, spark2): scala.MatchError: null
        org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:770)
        com.ynu.App$$anonfun$12.apply(App.scala:270)
        com.ynu.App$$anonfun$12.apply(App.scala:265)
        scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
        scala.collection.Iterator$class.foreach(Iterator.scala:727)
        scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
        scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
        scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
        scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
        scala.collection.AbstractIterator.to(Iterator.scala:1157)
        scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
        scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
        scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
        scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
        org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1080)
        org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1080)
        org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
        org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)

I don't know why. 我不知道为什么。 And if I want to implement this function what can I do? 如果我想实现这个功能我该怎么办？ Thanks 谢谢

Answer 1

You should see RDDs as virtual collections. 您应该将RDD视为虚拟集合。 The RDD reference, only points to where the data is, in itself it has no data, so there's no point on using it in a closure. RDD引用只指向数据所在的位置，它本身没有数据，因此在闭包中使用它没有意义。

You will need to use functions that combine RDDs together in order to achieve the desired functionality. 您需要使用将RDD组合在一起的功能，以实现所需的功能。 Also, lookup as defined here is a very sequential process that requires all the lookup data available in the memory of each worker - this will not scale up. 此外，这里定义的查找是一个非常顺序的过程，需要每个工作程序的内存中可用的所有查找数据 - 这不会扩展。

To resolve all elements of the file rdd that to their value in index you should join both RDDs: 要将file rdd所有元素解析为它们在index的值，您应该加入两个RDD：

val resolvedFileRDD = file.keyBy(identity).join(index) // this will have the form of (key, (key,index of key))

如何在其他RDDs map方法中使用RDD？

问题描述

1 个解决方案

解决方案1
5 已采纳 2014-12-11 10:43:53

如何在其他RDDs map方法中使用RDD？

问题描述

1 个解决方案

解决方案1 5 已采纳 2014-12-11 10:43:53

解决方案1
5 已采纳 2014-12-11 10:43:53