[英]Spark & Scala: can't get MappedRDD to perferm groupByKey from RDD
I am facing a disappointing issue while trying to use groupByKey
or any function of a PairRDD or MappedRDD. 我在尝试使用
groupByKey
或PairRDD或MappedRDD的任何功能时遇到令人失望的问题。 What I get is that I have always just a RDD
and I don't know how to convert it (really I am quite sure that the conversion should be automatically detected by Scala). 我得到的是,我始终只有一个
RDD
而且我不知道如何转换(确实,我很确定Scala应该会自动检测到该转换)。 My code is the following: 我的代码如下:
val broadcastedDistanceMeasure = sc.broadcast(dbScanSettings.distanceMeasure)
val distances = input.cartesian(input)
.filter(t => t._1!=t._2)
.map( {
case(p1, p2) => (p1) -> broadcastedDistanceMeasure.value.distance(p1,p2)
})
where input
is a RDD
. 其中
input
为RDD
。 And the resulting type according to Eclise and sbt run
is actually a RDD
. 根据Eclise和
sbt run
得到的结果类型实际上是RDD
。 So I cannot perform a groupByKey
operation. 因此,我无法执行
groupByKey
操作。 If I try almost the same code on the spark shell, instead, I get a MappedRDD
. 如果我在spark shell上尝试几乎相同的代码,
MappedRDD
得到MappedRDD
。
This is my build.sbt
file: 这是我的
build.sbt
文件:
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"
Can anybody help me? 有谁能够帮助我?
Thanks. 谢谢。
Greetings. 问候。
Marco 马尔科
I think that inside the IDE, you will never see the MappedRDD type for any RDD, since this is provided as an implicit conversion in the Spark Scala API. 我认为在IDE中,您将永远不会看到任何RDD的MappedRDD类型,因为这是Spark Scala API中的隐式转换。 If you look for example at the source of
SparkContext
you will see the implicit conversions from the common RDD to the specialized RDDs richer interfaces as PairRDDFunctions
, and from inside this specialized interfaces then you can use functions as groupByKey
which are made available thanks to the implicit conversions. 例如,如果您查看
SparkContext
的源代码,您将看到从常见RDD到专用RDD的更丰富接口的隐式转换(如PairRDDFunctions
,并且可以从该专用接口内部使用groupByKey
函数,这要归功于隐式转换。 So, in short, I think you only need to import org.apache.spark.SparkContext._
in order to achieve what you want. 因此,简而言之,我认为您只需导入
org.apache.spark.SparkContext._
即可实现所需的功能。
In this particular case, I think the specific conversion is 在这种情况下,我认为具体的转换是
implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
new PairRDDFunctions(rdd)
which wraps the RDD into a PairRDDFunctions , which in turns contains the groupByKey
operation. 将RDD包装到PairRDDFunctions中 ,而后者又包含
groupByKey
操作。
Hope it helped. 希望能有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.