如何在RddPair中使用reduceByKey <K,Tuple> 在斯卡拉

Question

I have this a CassandraTable. 我有一个CassandraTable。 Access by SparkContext.cassandraTable(). 通过SparkContext.cassandraTable（）访问。 Retrieve all my CassandraRow. 检索我所有的CassandraRow。

Now I want to store 3 information: (user, city, byte) I store like this 现在我要存储3个信息:(用户，城市，字节）我这样存储

rddUsersFilter.map(row =>
(row.getString("user"),(row.getString("city"),row.getString("byte").replace(",","").toLong))).groupByKey

I obtain a RDD[(String, Iterable[(String, Long)])] Now for each user I want to sum all bytes and create a Map for city like: "city"->"occurencies" (how many time this city appairs for this user). 我获得了一个RDD [（String，Iterable [（String，Long）]）]现在，对于每个用户，我想要对所有字节求和并为城市创建一个地图，如：“city” - >“occurencies”（这个城市有多少时间）这个用户的appairs）。

Previously, I split up this code in two differnt RDD, one to sum byte, the other one to create map as described. 以前，我将这个代码分成两个不同的RDD，一个用于汇总字节，另一个用于创建映射，如上所述。

Example for occurency for City 城市发生的示例

rddUsers.map(user => (user._1, user._2.size, user._2.groupBy(identity).map(city => (city._1,city._2.size))))

that's because I could access to second element of my tuple thanks to ._2 method. 那是因为我可以通过._2方法访问我的元组的第二个元素。 But now? 但现在？ My second element is a Iterable[(String,Long)], and I can't map anymore like I did before. 我的第二个元素是Iterable [（String，Long）]，我不能像以前那样映射。

Is there a solution to retrieve all my information with just one rdd and a single MapReduce? 有一个解决方案只用一个rdd和一个MapReduce来检索我的所有信息吗？

Answer 1

You could do this easily by first grouping bytes and city occurrence for user,city and then do a group by user 您可以通过首先为用户，城市分组字节和城市事件然后按用户进行分组来轻松完成此操作

val data = Array(("user1","city1",100),("user1","city1",100),
     ("user1","city1",100),("user1","city2",100),("user1","city2",100), 
     ("user1","city3",100),("user1","city2",100),("user2","city1",100),
     ("user2","city2",100))
val rdd = sc.parallelize(data)

val res = rdd.map(x=> ((x._1,x._2),(1,x._3)))
             .reduceByKey((x,y)=> (x._1+y._1,x._2+y._2))
             .map(x => (x._1._1,(x._1._2,x._2._1,x._2._2)))
             .groupByKey
val userCityUsageRdd = res.map(x => { 
 val m = x._2.toList
 (x._1 ,m.map(y => (y._1->y._2)).toMap, m.map(x => x._3).reduce(_+_))
})

output 产量

res20: Array[(String, scala.collection.immutable.Map[String,Int], Int)] = 
Array((user1,Map(city1 -> 3, city3 -> 1, city2 -> 3),700), 
      (user2,Map(city1 -> 1, city2 -> 1),200))

如何在RddPair中使用reduceByKey <K,Tuple> 在斯卡拉

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-08-05 13:21:54

如何在RddPair中使用reduceByKey <K,Tuple> 在斯卡拉

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-08-05 13:21:54

解决方案1
0 已采纳 2016-08-05 13:21:54