I'm trying to create a new RDD from an existing RDD.
Intilaize an Array
scala> var a1 = Array(1,2,3,4,5,6,7,8,9,10) a1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
Create the first RDD
scala> var r1 = sc.parallelize(a1) r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at < console>:26
Create the 2nd RDD - It fails with the following error.
scala> var newrdd = sc.parallelize(r1.map(data=>(data*2))) <console>:26: error: type mismatch; found: org.apache.spark.rdd.RDD[Int] required: Seq[?] Error occurred in an application involving default arguments. var newrdd = sc.parallelize(r1.map(data=>(data*2))) ^
But still the first array can be used to create another RDD. But it is not creating an RDD from an existing RDD.
scala> var newrdd = sc.parallelize(a1.map(data=>(data*2)))
newrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize
at <console>:26
Do you have any idea, What is the problem with this approach?
Or how I can create an RDD from an existing RDD?
Thanks for reading.
The signature of parallelize
method is:
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]
,so you cannot pass a RDD
as a parameter directly.
If you want to create an RDD from an existing RDD , you can use the methods defined for RDD
. For example,
val newrdd = r1.map(data => data * 2)
Or simply, r1.map(_ * 2)
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.