简体   繁体   English

在RDD中嵌套使用sc.textFile

[英]Nested use of sc.textFile inside an RDD

I need to use Spark to process data in a large set of text files based on querying another index. 我需要使用Spark根据查询另一个索引来处理大量文本文件中的数据。 I can do this for small cases (by converting the RDD to an array / see below), but am having difficulty in setting it up right to work with larger amounts of data. 我可以在较小的情况下执行此操作(通过将RDD转换为数组/参见下文),但是很难正确设置它以处理大量数据。

I have this: 我有这个:

val rootPath = "..."
val ndxRDD = sc.textFile(rootPath + "index/2016-09-01*")

def ndxToDoc(articleName: String): String = { sc.textFile(rootPath + articleName).first(); }

// works
val artcilesArr = ndxRDD.collect().map(ndxToDoc);
val articlesRDD = sc.parallelize(articlesArr)

// does not work
// val articlesRDD = ndxRDD.map(ndxToDoc)

articlesRDD.count()

I believe the problem is that I am trying to read the file inside the rdd. 我相信问题是我正在尝试读取rdd中的文件。 How do I get the above working without an intermediate collect() - map -> textFile() - parallelize() set? 我如何在没有中间collect() - map -> textFile() - parallelize()设置的情况下实现上述工作?

Thanks in advance! 提前致谢!

I think this is the optimal approach for this kind of the task. 我认为这是完成此类任务的最佳方法。 Its your use case that wants it ! 它是您想要的用例!

You must collect it as a list , otherwise you have to create a RDD inside a RDD which is not possible in the current implementation of Spark. 您必须将其作为列表收集,否则必须在RDD中创建一个RDD,而这在当前的Spark实现中是不可能的。

For more information why we cannot create RDD inside a RDD look here : 有关更多信息,为什么我们不能在RDD中创建RDD,请参见:

  1. Spark-Google-Group Discussion Spark-Google-Group讨论

  2. SPARK-NestedRDD 火花嵌套RDD

Hence this is an optimal approach, though I can suggest one thing is you can use OFFHeap memory to store large objects in memory 因此,这是一种最佳方法,尽管我可以建议您使用OFFHeap内存将大对象存储在内存中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM