简体   繁体   English

如何通过Scala spark-shell将大型RDD写入本地磁盘?

[英]How to write a large RDD to local disk through the Scala spark-shell?

Through a Scala spark-shell, I have access to an Elasticsearch db using the elasticsearch-hadoop-5.5.0 connector. 通过Scala spark-shell,我可以使用elasticsearch-hadoop-5.5.0连接器访问Elasticsearch数据库。

I generate my RDD by passing the following command in the spark-shell: 我通过在spark-shell中传递以下命令来生成RDD:

val myRdd = sc.esRDD("myIndex/type", myESQuery)

myRDD contains 2.1 million records across 15 partitions. myRDD包含15个分区中的210万条记录。 I have been trying to write all the data to a text file(s) on my local disk but when I try to run operations that convert the RDD to an array, like myRdd.collect(), I overload my java heap. 我一直试图将所有数据写入本地磁盘上的文本文件,但是当我尝试运行将RDD转换为数组的操作(如myRdd.collect())时,我使Java堆超载。

Is there a way to export the data (eg. 100k records at a time) incrementally so that I am never overloading my system memory? 有没有一种增量导出数据的方法(例如一次导出10万条记录),这样我就不会使系统内存超载?

When you use saveAsTextFile you can pass your filepath as "file:///path/to/output" to have it save locally. 使用saveAsTextFile ,可以将"file:///path/to/output"路径作为"file:///path/to/output"传递,以将其保存在本地。

Another option is to use rdd.toLocalIterator Which will allow you to iterate over the rdd on the driver. 另一个选择是使用rdd.toLocalIterator它将允许您遍历驱动程序上的rdd。 You can then write each line to a file. 然后,您可以将每一行写入文件。 This method avoids pulling all the records in at once. 此方法避免一次拉入所有记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM