[英]Spark two RDD join issue
I have two RDDs. 我有两个RDD。
moviesRDD =[(1,'monster'),(2,'minions 3D'),...] #(movieID,title)
ratingsRDD =[(1,(3,4)),(2,(4,5)),.....] #(movieID,(numbersofrating,avg_rating))
The ideal results is: 理想的结果是:
newRDD =[(3,'monster',4),(4,'minions 3D',5),....] #(numbersofrating,title,avg_rating)
I am not sure how to get newRDDs. 我不确定如何获取新的RDD。
This should do the trick: 这应该可以解决问题:
(moviesRDD
.join(ratingsRDD) # Join by key
.values() # Extract values
.map(lambda x: (x[1][0], x[0], x[1][1]))) # Reshape
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.