I have an RDD in Spark cluster. On client side I call collect(), then create a java stream from collected data and create a CSV file from this stream.
When I call collect() on RDD I bring all the data into memory on client side that is something I try to avoid. Is there any way to get RDD from Spark cluster as a stream?
I have a requirement not to bring logic that creates CSV to Spark cluster and keep it on client side.
I am using Standalone cluster and Java API.
I am no expert but I think I see what you are asking. Please post some code to help up get it better, if you can.
For now there are operations that work on a per-partition basis but I don't know if that's going to get you home, see toLocalIterator from the first answer on this question: Spark: Best practice for retrieving big data from RDD to local machine
You can control the number of partitions (per node I believe) with the second parameter to parallelize, "slices" but it's not documented well. Pretty sure if you search for partition on the Spark Programming Guide you'll get a good idea.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.