简体   繁体   English

如何知道Spark使用Scala推断出哪种RDD类型

[英]How to know which is the RDD type inferred by Spark using Scala

I was trying the follow example 我正在尝试以下示例

val lista = List(("a", 3), ("a", 1), ("b", 7), ("a", 5))
val rdd = sc.parallelize(lista)

Then in the shell I get the following 然后在shell中我得到以下内容

rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:30

But for some reason I still haven't figured out I was able to execute this sentence 但是由于某种原因,我仍然没有想到我能够执行这句话

val resAgg = rdd.aggregateByKey(new HashSet[Int])(_+_, _++_)

Getting this in the shell 将其放入外壳

resAgg: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.HashSet[Int])] = ShuffledRDD[41] at aggregateByKey at <console>:32

So I have some questions: 所以我有一些问题:

1.- What is the real RDD type of the var named rdd? 1.-名为rdd的var的真正RDD类型是什么? because in the shell it shows is of the type org.apache.spark.rdd.RDD[(String, Int)] but looking on the API the RDD class does not have a method aggregateByKey. 因为在外壳程序中它显示的类型是org.apache.spark.rdd.RDD [(String,Int)],但是在API上看,RDD类没有方法aggregateByKey。 By the way JavaPairRDD class does it have the aggregateByKey method 顺便说一下,JavaPairRDD类确实具有AggregateByKey方法

2.- How Can I verify/know the real type of a RDD 2.-我如何验证/知道RDD的真实类型

3.- What is that ParallelCollectionRDD showed up? 3.- ParallelCollectionRDD显示了什么? I looked for it on github and I found is a private class so I guess is that the reason why is does not appears on the scala API, but what it is for? 我在github上寻找它,发现是一个私有类,所以我猜这是为什么它不在scala API上出现,但是它的作用是什么?

I was using Spark 1.6.2 我正在使用Spark 1.6.2

What you're seeing is the effect of implicit conversion : 您所看到的是隐式转换的效果:

  • rdd does have the type org.apache.spark.rdd.RDD[(String, Int)] rdd 确实有类型org.apache.spark.rdd.RDD[(String, Int)]
  • When you try calling aggregateByKey and it isn't there for this type, compiler looks for some implicit conversion into some type that does - and finds this conversion into PairRDDFunctions : 当您尝试调用aggregateByKey且该类型不存在时,编译器会寻找将某种隐式转换为某种类型的隐式转换,并将转换转换为PairRDDFunctions

     implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = { new PairRDDFunctions(rdd) } 
  • Then, PairRDDFunctions.aggregateByKey is invoked. 然后,调用PairRDDFunctions.aggregateByKey

As for your last question: 至于最后一个问题:

What is that ParallelCollectionRDD 那是什么ParallelCollectionRDD

RDD is an abstract class with many subclasses, this is one of them. RDD是具有许多子类的抽象类,这是其中之一。 Generally speaking, each subclass is in charge of different actions done on the RDD, eg reading/writing/shuffling/checkpointing etc. This specific type is used when calling SparkContext.parallelize - meaning, it is used to parallelize a collection from the driver program. 一般来说,每个子类负责在RDD上执行的不同操作,例如读取/写入/改组/检查点等。此特定类型在调用SparkContext.parallelize时使用-意味着,它用于并行化驱动程序中的集合。 Indeed, it's private and you shouldn't generally care about which subtype of RDD you actually have at hand. 确实,它是私有的,您通常不必关心实际上拥有的RDD的哪种子类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM