简体   繁体   中英

Java RDD vs Scala RDD

I am working in Spark and picking up Scala along the way. I have a question about the RDD api and how the various base RDDs are implemented. Specifically, I ran the following code in spark-shell:

scala> val gspeech_path="/home/myuser/gettysburg.txt"
gspeech_path: String = /home/myuser/gettysburg.txt

scala> val lines=sc.textFile(gspeech_path)
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] 
at textFile at <console>:29

scala> val pairs = lines.map(x => (x.split(" ")(0), x))
pairs: org.apache.spark.rdd.RDD[(String, String)] =  
MapPartitionsRDD[8] at map at <console>:3

scala> val temps:Seq[(String,Seq[Double])]=Seq(("SP",Seq(68,70,75)),
                                       ("TR",Seq(87,83,88,84,88)), 
                                       ("EN",Seq(52,55,58,57.5)),
                                       ("ER",Seq(90,91.3,88,91)))

temps: Seq[(String, Seq[Double])] = List((SP,List(68.0, 70.0, 75.0)), 
(TR,List(87.0, 83.0, 88.0, 84.0, 88.0)), (EN,List(52.0, 55.0, 58.0,  
57.5)), (ER,List(90.0, 91.3, 88.0, 91.0)))

scala> var temps_rdd0=sc.parallelize(temps)
temps_rdd0: org.apache.spark.rdd.RDD[(String, Seq[Double])] = 
ParallelCollectionRDD[9] at parallelize at <console>:29

I wanted to investigate a bit more and looked up the API for MapPartitionsRDD and ParallelCollectionRDD expecting that they would be subclasses of the base RDD org.apache.spark.rdd. However, I couldn't find these classes when I searched the Spark Scala API (Scaladocs)

I was able to find them only in the Java docs not the Scala docs at spark.apache.org. From what I know of Scala the two languages can intermingle as Spark is written in Java. However, I would appreciate some clarification as to the exact relationship as it pertains to RDDs. So is it the case that we have an abstract Scala RDD reference whose underlying implementation is a Java RDD as per this response :

# Scala abstract RDD = Concrete Java MapPartitionsRDD
org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] 

?

Thanks in advance for your help/explanation.

As @Archeg pointed out in his comment above, these classes are indeed Scala classes and can be found at org.apache.spark.rdd.MapPartitionsRDD

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala

What caused my confusion was that I couldn't find MapPartitionsRDD when I did a search in the Spark Scala API (Scaladoc)

MapPartitionsRDD is an RDD that applies the provided function f to every partition of the parent RDD.

By default, it does not preserve partitioning — the last input parameter preservesPartitioning is false. If it is true, it retains the original RDD's partitioning.

MapPartitionsRDD is the result of the following transformations:

  1. map
  2. flatmap
  3. filter
  4. glom
  5. mapPartitions
  6. mapPartitionsWithIndex
  7. PairRDDFunctions.mapValues
  8. PairRDDFunctions.flatMapValues

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM