简体繁体中英

Apache Spark: reading RDD from Spark Cluster

原文 2015-11-17 15:38:31 5 1 java/ apache-spark

I have an RDD in Spark cluster. On client side I call collect(), then create a java stream from collected data and create a CSV file from this stream.

When I call collect() on RDD I bring all the data into memory on client side that is something I try to avoid. Is there any way to get RDD from Spark cluster as a stream?

I have a requirement not to bring logic that creates CSV to Spark cluster and keep it on client side.

I am using Standalone cluster and Java API.

1 answers

I am no expert but I think I see what you are asking. Please post some code to help up get it better, if you can.

For now there are operations that work on a per-partition basis but I don't know if that's going to get you home, see toLocalIterator from the first answer on this question: Spark: Best practice for retrieving big data from RDD to local machine

You can control the number of partitions (per node I believe) with the second parameter to parallelize, "slices" but it's not documented well. Pretty sure if you search for partition on the Spark Programming Guide you'll get a good idea.

http://spark.apache.org/docs/latest/programming-guide.html

Compare RDD Objects - Apache Spark

Iteration on RDD data Apache Spark

Adding data to a hashmap from on apache-spark RDD operation (Java)

Spark Streaming from existing RDD

CSV to RDD to Cassandra store in Apache Spark

Apache Spark Broadcast variables are type Broadcast? Not a RDD?

Apache Spark RDD and Java 8: Exception handling

Write RDD as textfile using Apache Spark

Sorting an RDD in Apache Spark using mapPartitions and reduce

Apache Spark cheapest way to trigger a RDD transformation

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Compare RDD Objects - Apache Spark Iteration on RDD data Apache Spark Adding data to a hashmap from on apache-spark RDD operation (Java) Spark Streaming from existing RDD CSV to RDD to Cassandra store in Apache Spark Apache Spark Broadcast variables are type Broadcast? Not a RDD? Apache Spark RDD and Java 8: Exception handling Write RDD as textfile using Apache Spark Sorting an RDD in Apache Spark using mapPartitions and reduce Apache Spark cheapest way to trigger a RDD transformation

Related Tags

Apache Spark: reading RDD from Spark Cluster

Question

1 answers

solution1 0 ACCPTED 2015-11-28 00:13:39

solution1
0 ACCPTED 2015-11-28 00:13:39