如何通过Scala spark-shell将大型RDD写入本地磁盘？

Question

Through a Scala spark-shell, I have access to an Elasticsearch db using the elasticsearch-hadoop-5.5.0 connector. 通过Scala spark-shell，我可以使用elasticsearch-hadoop-5.5.0连接器访问Elasticsearch数据库。

I generate my RDD by passing the following command in the spark-shell: 我通过在spark-shell中传递以下命令来生成RDD：

val myRdd = sc.esRDD("myIndex/type", myESQuery)

myRDD contains 2.1 million records across 15 partitions. myRDD包含15个分区中的210万条记录。 I have been trying to write all the data to a text file(s) on my local disk but when I try to run operations that convert the RDD to an array, like myRdd.collect(), I overload my java heap. 我一直试图将所有数据写入本地磁盘上的文本文件，但是当我尝试运行将RDD转换为数组的操作（如myRdd.collect（））时，我使Java堆超载。

Is there a way to export the data (eg. 100k records at a time) incrementally so that I am never overloading my system memory? 有没有一种增量导出数据的方法（例如一次导出10万条记录），这样我就不会使系统内存超载？

Answer 1

When you use saveAsTextFile you can pass your filepath as "file:///path/to/output" to have it save locally. 使用saveAsTextFile ，可以将"file:///path/to/output"路径作为"file:///path/to/output"传递，以将其保存在本地。

Another option is to use rdd.toLocalIterator Which will allow you to iterate over the rdd on the driver. 另一个选择是使用rdd.toLocalIterator它将允许您遍历驱动程序上的rdd。 You can then write each line to a file. 然后，您可以将每一行写入文件。 This method avoids pulling all the records in at once. 此方法避免一次拉入所有记录。

如何通过Scala spark-shell将大型RDD写入本地磁盘？

问题描述

1 个解决方案

解决方案1
0 2017-08-04 13:14:24

如何通过Scala spark-shell将大型RDD写入本地磁盘？

问题描述

1 个解决方案

解决方案1 0 2017-08-04 13:14:24

解决方案1
0 2017-08-04 13:14:24