简体   繁体   中英

How to know which is the RDD type inferred by Spark using Scala

I was trying the follow example

val lista = List(("a", 3), ("a", 1), ("b", 7), ("a", 5))
val rdd = sc.parallelize(lista)

Then in the shell I get the following

rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:30

But for some reason I still haven't figured out I was able to execute this sentence

val resAgg = rdd.aggregateByKey(new HashSet[Int])(_+_, _++_)

Getting this in the shell

resAgg: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.HashSet[Int])] = ShuffledRDD[41] at aggregateByKey at <console>:32

So I have some questions:

1.- What is the real RDD type of the var named rdd? because in the shell it shows is of the type org.apache.spark.rdd.RDD[(String, Int)] but looking on the API the RDD class does not have a method aggregateByKey. By the way JavaPairRDD class does it have the aggregateByKey method

2.- How Can I verify/know the real type of a RDD

3.- What is that ParallelCollectionRDD showed up? I looked for it on github and I found is a private class so I guess is that the reason why is does not appears on the scala API, but what it is for?

I was using Spark 1.6.2

What you're seeing is the effect of implicit conversion :

  • rdd does have the type org.apache.spark.rdd.RDD[(String, Int)]
  • When you try calling aggregateByKey and it isn't there for this type, compiler looks for some implicit conversion into some type that does - and finds this conversion into PairRDDFunctions :

     implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = { new PairRDDFunctions(rdd) } 
  • Then, PairRDDFunctions.aggregateByKey is invoked.

As for your last question:

What is that ParallelCollectionRDD

RDD is an abstract class with many subclasses, this is one of them. Generally speaking, each subclass is in charge of different actions done on the RDD, eg reading/writing/shuffling/checkpointing etc. This specific type is used when calling SparkContext.parallelize - meaning, it is used to parallelize a collection from the driver program. Indeed, it's private and you shouldn't generally care about which subtype of RDD you actually have at hand.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM