Unable to create an RDD from an existing RDD - Apache Spark

Question

I'm trying to create a new RDD from an existing RDD.

Intilaize an Array

scala> var a1 = Array(1,2,3,4,5,6,7,8,9,10) a1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Create the first RDD

 scala> var r1 = sc.parallelize(a1) r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at < console>:26

Create the 2nd RDD - It fails with the following error.

 scala> var newrdd = sc.parallelize(r1.map(data=>(data*2))) <console>:26: error: type mismatch; found: org.apache.spark.rdd.RDD[Int] required: Seq[?] Error occurred in an application involving default arguments. var newrdd = sc.parallelize(r1.map(data=>(data*2))) ^

But still the first array can be used to create another RDD. But it is not creating an RDD from an existing RDD.

scala> var newrdd = sc.parallelize(a1.map(data=>(data*2)))
newrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize
at <console>:26

Do you have any idea, What is the problem with this approach?

Or how I can create an RDD from an existing RDD?

Thanks for reading.

Answer 1

The signature of parallelize method is:

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

,so you cannot pass a RDD as a parameter directly.

If you want to create an RDD from an existing RDD , you can use the methods defined for RDD . For example,

val newrdd = r1.map(data => data * 2)

Or simply, r1.map(_ * 2) .

Unable to create an RDD from an existing RDD - Apache Spark

Question

1 answers

solution1
1 ACCPTED 2021-06-16 04:51:49

Unable to create an RDD from an existing RDD - Apache Spark

Question

1 answers

solution1 1 ACCPTED 2021-06-16 04:51:49

solution1
1 ACCPTED 2021-06-16 04:51:49