在RDD中嵌套使用sc.textFile

Question

I need to use Spark to process data in a large set of text files based on querying another index. 我需要使用Spark根据查询另一个索引来处理大量文本文件中的数据。 I can do this for small cases (by converting the RDD to an array / see below), but am having difficulty in setting it up right to work with larger amounts of data. 我可以在较小的情况下执行此操作（通过将RDD转换为数组/参见下文），但是很难正确设置它以处理大量数据。

I have this: 我有这个：

val rootPath = "..."
val ndxRDD = sc.textFile(rootPath + "index/2016-09-01*")

def ndxToDoc(articleName: String): String = { sc.textFile(rootPath + articleName).first(); }

// works
val artcilesArr = ndxRDD.collect().map(ndxToDoc);
val articlesRDD = sc.parallelize(articlesArr)

// does not work
// val articlesRDD = ndxRDD.map(ndxToDoc)

articlesRDD.count()

I believe the problem is that I am trying to read the file inside the rdd. 我相信问题是我正在尝试读取rdd中的文件。 How do I get the above working without an intermediate collect() - map -> textFile() - parallelize() set? 我如何在没有中间collect() - map -> textFile() - parallelize()设置的情况下实现上述工作？

Thanks in advance! 提前致谢！

Answer 1

I think this is the optimal approach for this kind of the task. 我认为这是完成此类任务的最佳方法。 Its your use case that wants it ! 它是您想要的用例！

You must collect it as a list , otherwise you have to create a RDD inside a RDD which is not possible in the current implementation of Spark. 您必须将其作为列表收集，否则必须在RDD中创建一个RDD，而这在当前的Spark实现中是不可能的。

For more information why we cannot create RDD inside a RDD look here : 有关更多信息，为什么我们不能在RDD中创建RDD，请参见：

Spark-Google-Group Discussion Spark-Google-Group讨论
SPARK-NestedRDD 火花嵌套RDD

Hence this is an optimal approach, though I can suggest one thing is you can use OFFHeap memory to store large objects in memory 因此，这是一种最佳方法，尽管我可以建议您使用OFFHeap内存将大对象存储在内存中

在RDD中嵌套使用sc.textFile

问题描述

1 个解决方案

解决方案1
0 2016-10-22 20:26:43

在RDD中嵌套使用sc.textFile

问题描述

1 个解决方案

解决方案1 0 2016-10-22 20:26:43

解决方案1
0 2016-10-22 20:26:43