[英]How to write a large RDD to local disk through the Scala spark-shell?
Through a Scala spark-shell, I have access to an Elasticsearch db using the elasticsearch-hadoop-5.5.0 connector. 通过Scala spark-shell,我可以使用elasticsearch-hadoop-5.5.0连接器访问Elasticsearch数据库。
I generate my RDD by passing the following command in the spark-shell: 我通过在spark-shell中传递以下命令来生成RDD:
val myRdd = sc.esRDD("myIndex/type", myESQuery)
myRDD contains 2.1 million records across 15 partitions. myRDD包含15个分区中的210万条记录。 I have been trying to write all the data to a text file(s) on my local disk but when I try to run operations that convert the RDD to an array, like myRdd.collect(), I overload my java heap.
我一直试图将所有数据写入本地磁盘上的文本文件,但是当我尝试运行将RDD转换为数组的操作(如myRdd.collect())时,我使Java堆超载。
Is there a way to export the data (eg. 100k records at a time) incrementally so that I am never overloading my system memory? 有没有一种增量导出数据的方法(例如一次导出10万条记录),这样我就不会使系统内存超载?
When you use saveAsTextFile
you can pass your filepath as "file:///path/to/output"
to have it save locally. 使用
saveAsTextFile
,可以将"file:///path/to/output"
路径作为"file:///path/to/output"
传递,以将其保存在本地。
Another option is to use rdd.toLocalIterator
Which will allow you to iterate over the rdd on the driver. 另一个选择是使用
rdd.toLocalIterator
它将允许您遍历驱动程序上的rdd。 You can then write each line to a file. 然后,您可以将每一行写入文件。 This method avoids pulling all the records in at once.
此方法避免一次拉入所有记录。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.