Scala（Spark）中RDD的总计

Question

If I have a variable such as books: RDD[(String, Integer, Integer)] , how do I want to merge keys with the same String (could represent title), and then sum the corresponding two integers (could represent pages and price). 如果我有一个变量，例如books: RDD[(String, Integer, Integer)] ，我该如何合并具有相同String的键（可以表示标题），然后求和对应的两个整数（可以表示页数和价格））。

ex: 例如：

[("book1", 20, 10),
 ("book2", 5, 10),
 ("book1", 100, 100)]

becomes 变

[("book1", 120, 110),
 ("book2", 5, 10)]

Answer 1

With an RDD you can use reduceByKey . 使用RDD ，可以使用reduceByKey 。

case class Book(name: String, i: Int, j: Int) {
  def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}

val rdd = sc.parallelize(Seq(
   Book("book1", 20, 10), 
   Book("book2",5,10), 
   Book("book1",100,100)))

val aggRdd = rdd.map(book => (book.name, book))
   .reduceByKey(_+_) // reduce calling our defined `+` function
   .map(_._2)        // we don't need the tuple anymore, just get the Books

aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)

Answer 2

Try converting it first to a key-tuple RDD and then performing a reduceByKey : 尝试先将其转换为键元组RDD ，然后执行reduceByKey ：

yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))

Output: 输出：

(book2,(5,10))
(book1,(120,110))

Answer 3

Just use Dataset : 只需使用Dataset ：

val spark: SparkSession = SparkSession.builder.getOrCreate()

val rdd = spark.sparkContext.parallelize(Seq(
  ("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))

spark.createDataFrame(rdd).groupBy("_1").sum().show()

// +-----+-------+-------+                                                         
// |   _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1|    120|    110|
// |book2|      5|     10|
// +-----+-------+-------+

Scala（Spark）中RDD的总计

问题描述

3 个解决方案

解决方案1
3 已采纳 2018-01-31 18:51:03

解决方案2
2 2018-01-31 18:36:07

解决方案3
1 2018-01-31 18:32:23

Scala（Spark）中RDD的总计

问题描述

3 个解决方案

解决方案1 3 已采纳 2018-01-31 18:51:03

解决方案2 2 2018-01-31 18:36:07

解决方案3 1 2018-01-31 18:32:23

解决方案1
3 已采纳 2018-01-31 18:51:03

解决方案2
2 2018-01-31 18:36:07

解决方案3
1 2018-01-31 18:32:23