Scala / Spark-聚合RDD

Question

Just wondering how I can do the following: 只是想知道我该如何做以下事情：

Suppose I have an RDD containing (username, age, movieBought) for many usernames and some lines can have the same username and age but a different movieBought. 假设我有一个包含许多用户名的RDD（用户名，年龄，movieBought），并且某些行可以具有相同的用户名和年龄，但不同的movieBought。

How can I remove the duplicated lines and transform it into (username, age, movieBought1, movieBought2...)? 如何删除重复的行并将其转换为（用户名，年龄，movieBought1，movieBought2 ...）？

Kind Regards 亲切的问候

Answer 1

val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(_._3)))

val results = grouped.collect.toList

UPDATE (if each tuple also has number of movies item): 更新（如果每个元组也具有电影项数）：

val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(m => (m._3, m._4))))

val results = grouped.collect.toList

Answer 2

I was gonna suggest collect and to list, but ka4eli beat me to it. 我本来建议收集并列出，但ka4eli击败了我。

I guess you could also use the groupBy / groupByKey and then reduce/reduceByKey operation. 我猜您也可以使用groupBy / groupByKey，然后进行reduce / reduceByKey操作。 The downside of this ofc is that the result (movie1,movie2,movie3..) are concatenated into 1 string (instead of a List structure, which makes accessing it difficult). 此结果的缺点是将结果（movie1，movie2，movie3 ..）串联为1个字符串（而不是List结构，这使访问变得困难）。

val group = rdd.map(x=>((x.name,x.age),x.movie))).groupBy(_._1)
val result =  group.map(x=>(x._1._1,x._1._2,x._2.map(y=>y._2).reduce(_+","+_)

Scala / Spark-聚合RDD

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-08-10 10:54:44

解决方案2
0 2015-08-10 10:58:11

Scala / Spark-聚合RDD

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-08-10 10:54:44

解决方案2 0 2015-08-10 10:58:11

解决方案1
1 已采纳 2015-08-10 10:54:44

解决方案2
0 2015-08-10 10:58:11