简体   繁体   English

Spark&Scala:无法从RDD中将MappedRDD转换为perferm groupByKey

[英]Spark & Scala: can't get MappedRDD to perferm groupByKey from RDD

I am facing a disappointing issue while trying to use groupByKey or any function of a PairRDD or MappedRDD. 我在尝试使用groupByKey或PairRDD或MappedRDD的任何功能时遇到令人失望的问题。 What I get is that I have always just a RDD and I don't know how to convert it (really I am quite sure that the conversion should be automatically detected by Scala). 我得到的是,我始终只有一个RDD而且我不知道如何转换(确实,我很确定Scala应该会自动检测到该转换)。 My code is the following: 我的代码如下:

val broadcastedDistanceMeasure = sc.broadcast(dbScanSettings.distanceMeasure)
val distances = input.cartesian(input)
  .filter(t => t._1!=t._2)
  .map( { 
    case(p1, p2) => (p1) -> broadcastedDistanceMeasure.value.distance(p1,p2)
  })

where input is a RDD . 其中inputRDD And the resulting type according to Eclise and sbt run is actually a RDD . 根据Eclise和sbt run得到的结果类型实际上是RDD So I cannot perform a groupByKey operation. 因此,我无法执行groupByKey操作。 If I try almost the same code on the spark shell, instead, I get a MappedRDD . 如果我在spark shell上尝试几乎相同的代码, MappedRDD得到MappedRDD

This is my build.sbt file: 这是我的build.sbt文件:

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"

Can anybody help me? 有谁能够帮助我?

Thanks. 谢谢。

Greetings. 问候。

Marco 马尔科

I think that inside the IDE, you will never see the MappedRDD type for any RDD, since this is provided as an implicit conversion in the Spark Scala API. 我认为在IDE中,您将永远不会看到任何RDD的MappedRDD类型,因为这是Spark Scala API中的隐式转换。 If you look for example at the source of SparkContext you will see the implicit conversions from the common RDD to the specialized RDDs richer interfaces as PairRDDFunctions , and from inside this specialized interfaces then you can use functions as groupByKey which are made available thanks to the implicit conversions. 例如,如果您查看SparkContext源代码,您将看到从常见RDD到专用RDD的更丰富接口的隐式转换(如PairRDDFunctions ,并且可以从该专用接口内部使用groupByKey函数,这要归功于隐式转换。 So, in short, I think you only need to import org.apache.spark.SparkContext._ in order to achieve what you want. 因此,简而言之,我认为您只需导入org.apache.spark.SparkContext._即可实现所需的功能。

In this particular case, I think the specific conversion is 在这种情况下,我认为具体的转换是

implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
    new PairRDDFunctions(rdd)

which wraps the RDD into a PairRDDFunctions , which in turns contains the groupByKey operation. 将RDD包装到PairRDDFunctions中 ,而后者又包含groupByKey操作。

Hope it helped. 希望能有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM