简体   繁体   中英

Apache Spark: reading RDD from Spark Cluster

I have an RDD in Spark cluster. On client side I call collect(), then create a java stream from collected data and create a CSV file from this stream.

When I call collect() on RDD I bring all the data into memory on client side that is something I try to avoid. Is there any way to get RDD from Spark cluster as a stream?

I have a requirement not to bring logic that creates CSV to Spark cluster and keep it on client side.

I am using Standalone cluster and Java API.

I am no expert but I think I see what you are asking. Please post some code to help up get it better, if you can.

For now there are operations that work on a per-partition basis but I don't know if that's going to get you home, see toLocalIterator from the first answer on this question: Spark: Best practice for retrieving big data from RDD to local machine

You can control the number of partitions (per node I believe) with the second parameter to parallelize, "slices" but it's not documented well. Pretty sure if you search for partition on the Spark Programming Guide you'll get a good idea.

http://spark.apache.org/docs/latest/programming-guide.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM