![](/img/trans.png)
[英]Splitting an RDD[String] type text to RDD[String] type words (Scala, Apache Spark)
[英]Joining two RDD[String] -Spark Scala
我有两个RDDS:
rdd1 [String,String,String]: Name, Address, Zipcode
rdd2 [String,String,String]: Name, Address, Landmark
我正在尝试使用以下功能加入这两个RDD: rdd1.join(rdd2)
但是我遇到一个错误:
error: value fullOuterJoin is not a member of org.apache.spark.rdd.RDD[String]
联接应该联接RDD [String],并且输出RDD应该类似于:
rddOutput : Name,Address,Zipcode,Landmark
最后,我想将这些文件另存为JSON文件。
有人可以帮我吗?
如评论中所述,必须在加入之前将RDD转换为PairRDD,这意味着每个RDD必须为RDD[(key, value)]
。 只有这样,您才能通过键执行联接。 在您的情况下,密钥由(名称,地址)组成,因此您将必须执行以下操作:
// First, we create the first PairRDD, with (name, address) as key and zipcode as value:
val pairRDD1 = rdd1.map { case (name, address, zipcode) => ((name, address), zipcode) }
// Then, we create the second PairRDD, with (name, address) as key and landmark as value:
val pairRDD2 = rdd2.map { case (name, address, landmark) => ((name, address), landmark) }
// Now we can join them.
// The result will be an RDD of ((name, address), (zipcode, landmark)), so we can map to the desired format:
val joined = pairRDD1.fullOuterJoin(pairRDD2).map {
case ((name, address), (zipcode, landmark)) => (name, address, zipcode, landmark)
}
Spark的Scala API文档中有关PairRDD函数的更多信息
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.