简体   繁体   English

字符串RDD连接操作

[英]String RDD join operation

I have RDD of Strings in scala. 我在scala中有字符串的RDD。 The strings are id's. 字符串是id。 It would be something like this. 就像这样。

1
2
3
4

I have another RDD with (id, name) like this. 我有另一个带有(id,name)的RDD。

(1, Name1)
(2, Name2)
(3, Name3)
(4, Name4)
(5, Name5)
(6, Name6)

Now I want to the get names for all the ids in the first RDD. 现在,我要获取第一个RDD中所有ID的名称。 How do I do this? 我该怎么做呢?

I realized that if the first RDD was a pairRDD I could just join the two RDD's. 我意识到,如果第一个RDD是pairRDD,那么我可以加入两个RDD。 So why do we only have join operations for pairRDD? 那么,为什么只对pairRDD有联接操作?

尝试这个:

rdd1.map(x => (x, null)).join(rdd2).mapValues(x => x._2)

Based on your comment to CafeFeeds answer you could consider a 'broadcast join' if the ids RDD is small enough. 根据您对CafeFeeds答案的评论,如果ids RDD足够小,则可以考虑“广播加入”。

val ids: RDD[Int] = ???
val names: RDD[(Int, String)] = ???
val bcIds = sc.broadcast(ids.collect.toSet)
val result = names.filter(x => bcIds.value.contains(x._2))

The benefit of this is that you don't need to shuffle the names RDD so if it is significantly larger you'll reduce the amount of work that needs to be done significantly. 这样做的好处是您不需要改组RDD名称,因此,如果RDD的名称明显变大,则可以减少需要大量完成的工作量。 Other than that the simple join method is best. 除此之外,简单的连接方法是最好的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM