[英]Aggregating sum for RDD in Scala (Spark)
If I have a variable such as books: RDD[(String, Integer, Integer)]
, how do I want to merge keys with the same String (could represent title), and then sum the corresponding two integers (could represent pages and price). 如果我有一个变量,例如books: RDD[(String, Integer, Integer)]
,我该如何合并具有相同String的键(可以表示标题),然后求和对应的两个整数(可以表示页数和价格) )。
ex: 例如:
[("book1", 20, 10),
("book2", 5, 10),
("book1", 100, 100)]
becomes 变
[("book1", 120, 110),
("book2", 5, 10)]
With an RDD
you can use reduceByKey
. 使用RDD
,可以使用reduceByKey
。
case class Book(name: String, i: Int, j: Int) {
def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}
val rdd = sc.parallelize(Seq(
Book("book1", 20, 10),
Book("book2",5,10),
Book("book1",100,100)))
val aggRdd = rdd.map(book => (book.name, book))
.reduceByKey(_+_) // reduce calling our defined `+` function
.map(_._2) // we don't need the tuple anymore, just get the Books
aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)
Try converting it first to a key-tuple RDD
and then performing a reduceByKey
: 尝试先将其转换为键元组RDD
,然后执行reduceByKey
:
yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))
Output: 输出:
(book2,(5,10))
(book1,(120,110))
Just use Dataset
: 只需使用Dataset
:
val spark: SparkSession = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq(
("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))
spark.createDataFrame(rdd).groupBy("_1").sum().show()
// +-----+-------+-------+
// | _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1| 120| 110|
// |book2| 5| 10|
// +-----+-------+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.