简体   繁体   English

Scala / Spark-聚合RDD

[英]Scala/Spark - Aggregating RDD

Just wondering how I can do the following: 只是想知道我该如何做以下事情:

Suppose I have an RDD containing (username, age, movieBought) for many usernames and some lines can have the same username and age but a different movieBought. 假设我有一个包含许多用户名的RDD(用户名,年龄,movieBought),并且某些行可以具有相同的用户名和年龄,但不同的movieBought。

How can I remove the duplicated lines and transform it into (username, age, movieBought1, movieBought2...)? 如何删除重复的行并将其转换为(用户名,年龄,movieBought1,movieBought2 ...)?

Kind Regards 亲切的问候

val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(_._3)))

val results = grouped.collect.toList

UPDATE (if each tuple also has number of movies item): 更新 (如果每个元组也具有电影项数):

val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(m => (m._3, m._4))))

val results = grouped.collect.toList

I was gonna suggest collect and to list, but ka4eli beat me to it. 我本来建议收集并列出,但ka4eli击败了我。

I guess you could also use the groupBy / groupByKey and then reduce/reduceByKey operation. 我猜您也可以使用groupBy / groupByKey,然后进行reduce / reduceByKey操作。 The downside of this ofc is that the result (movie1,movie2,movie3..) are concatenated into 1 string (instead of a List structure, which makes accessing it difficult). 此结果的缺点是将结果(movie1,movie2,movie3 ..)串联为1个字符串(而不是List结构,这使访问变得困难)。

val group = rdd.map(x=>((x.name,x.age),x.movie))).groupBy(_._1)
val result =  group.map(x=>(x._1._1,x._1._2,x._2.map(y=>y._2).reduce(_+","+_)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM