简体   繁体   English

Scala(Spark)中RDD的总计

[英]Aggregating sum for RDD in Scala (Spark)

If I have a variable such as books: RDD[(String, Integer, Integer)] , how do I want to merge keys with the same String (could represent title), and then sum the corresponding two integers (could represent pages and price). 如果我有一个变量,例如books: RDD[(String, Integer, Integer)] ,我该如何合并具有相同String的键(可以表示标题),然后求和对应的两个整数(可以表示页数和价格) )。

ex: 例如:

[("book1", 20, 10),
 ("book2", 5, 10),
 ("book1", 100, 100)]

becomes

[("book1", 120, 110),
 ("book2", 5, 10)]

With an RDD you can use reduceByKey . 使用RDD ,可以使用reduceByKey

case class Book(name: String, i: Int, j: Int) {
  def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}

val rdd = sc.parallelize(Seq(
   Book("book1", 20, 10), 
   Book("book2",5,10), 
   Book("book1",100,100)))

val aggRdd = rdd.map(book => (book.name, book))
   .reduceByKey(_+_) // reduce calling our defined `+` function
   .map(_._2)        // we don't need the tuple anymore, just get the Books

aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)

Try converting it first to a key-tuple RDD and then performing a reduceByKey : 尝试先将其转换为键元组RDD ,然后执行reduceByKey

yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))

Output: 输出:

(book2,(5,10))
(book1,(120,110))

Just use Dataset : 只需使用Dataset

val spark: SparkSession = SparkSession.builder.getOrCreate()

val rdd = spark.sparkContext.parallelize(Seq(
  ("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))

spark.createDataFrame(rdd).groupBy("_1").sum().show()

// +-----+-------+-------+                                                         
// |   _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1|    120|    110|
// |book2|      5|     10|
// +-----+-------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM