简体   繁体   中英

Unable to create an RDD from an existing RDD - Apache Spark

I'm trying to create a new RDD from an existing RDD.

  1. Intilaize an Array

    scala> var a1 = Array(1,2,3,4,5,6,7,8,9,10) a1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  2. Create the first RDD

     scala> var r1 = sc.parallelize(a1) r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at < console>:26
  3. Create the 2nd RDD - It fails with the following error.

     scala> var newrdd = sc.parallelize(r1.map(data=>(data*2))) <console>:26: error: type mismatch; found: org.apache.spark.rdd.RDD[Int] required: Seq[?] Error occurred in an application involving default arguments. var newrdd = sc.parallelize(r1.map(data=>(data*2))) ^

But still the first array can be used to create another RDD. But it is not creating an RDD from an existing RDD.

scala> var newrdd = sc.parallelize(a1.map(data=>(data*2)))
newrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize
at <console>:26

Do you have any idea, What is the problem with this approach?

Or how I can create an RDD from an existing RDD?

Thanks for reading.

The signature of parallelize method is:

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T] 

,so you cannot pass a RDD as a parameter directly.

If you want to create an RDD from an existing RDD , you can use the methods defined for RDD . For example,

val newrdd = r1.map(data => data * 2)

Or simply, r1.map(_ * 2) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM