[英]Spark: Mapgroups on a Dataset
I'm trying this mapgroups function on the below dataset and not sure why I'm getting 0 for the "Total Value" column. 我正在尝试在下面的数据集中使用此mapgroups函数,但不确定为什么我的“总值”列为0。 Am I missing something here??? 我在这里想念什么吗??? Please advice 请指教
Spark Version - 2.0 Scala Version - 2.11 Spark版本-2.0 Scala版本-2.11
case class Record(Hour: Int, Category: String,TotalComm: Double, TotalValue: Int)
val ss = (SparkSession)
import ss.implicits._
val df: DataFrame = ss.sparkContext.parallelize(Seq(
(0, "cat26", 30.9, 200), (0, "cat26", 22.1, 100), (0, "cat95", 19.6, 300), (1, "cat4", 1.3, 100),
(1, "cat23", 28.5, 100), (1, "cat4", 26.8, 400), (1, "cat13", 12.6, 250), (1, "cat23", 5.3, 300),
(0, "cat26", 39.6, 30), (2, "cat40", 29.7, 500), (1, "cat4", 27.9, 600), (2, "cat68", 9.8, 100),
(1, "cat23", 35.6, 500))).toDF("Hour", "Category","TotalComm", "TotalValue")
val resultSum = df.as[Record].map(row => ((row.Hour,row.Category),(row.TotalComm,row.TotalValue)))
.groupByKey(_._1).mapGroups{case(k,iter) => (k._1,k._2,iter.map(x => x._2._1).sum,iter.map(y => y._2._2).sum)}
.toDF("KeyHour","KeyCategory","TotalComm","TotalValue").orderBy(asc("KeyHour"))
resultSum.show()
+-------+-----------+---------+----------+
|KeyHour|KeyCategory|TotalComm|TotalValue|
+-------+-----------+---------+----------+
| 0| cat26| 92.6| 0|
| 0| cat95| 19.6| 0|
| 1| cat13| 12.6| 0|
| 1| cat23| 69.4| 0|
| 1| cat4| 56.0| 0|
| 2| cat40| 29.7| 0|
| 2| cat68| 9.8| 0|
+-------+-----------+---------+----------+
iter
inside mapGroups
is a buffer and computation can be perfomed only once . mapGroups
内部的iter
是一个缓冲区 , 计算只能进行一次 。 So when you sum as iter.map(x => x._2._1).sum
then there is nothing left in iter buffer and thus iter.map(y => y._2._2).sum
operation yields 0 . 因此,当您将iter.map(x => x._2._1).sum
, iter缓冲区中就什么都没有了 ,因此iter.map(y => y._2._2).sum
运算得出0 。 So you will have to find a mechanism to calculate sum of both in the same iteration 因此,您将必须找到一种在同一迭代中计算两者之和的机制
for loop with ListBuffers for循环与ListBuffers
for simplicity I have used for
loop and ListBuffer
to sum both at once 为了简单起见,我使用for
loop和ListBuffer
来一次求和
val resultSum = df.as[Record].map(row => ((row.Hour,row.Category),(row.TotalComm,row.TotalValue)))
.groupByKey(_._1).mapGroups{case(k,iter) => {
val listBuffer1 = new ListBuffer[Double]
val listBuffer2 = new ListBuffer[Int]
for(a <- iter){
listBuffer1 += a._2._1
listBuffer2 += a._2._2
}
(k._1, k._2, listBuffer1.sum, listBuffer2.sum)
}}
.toDF("KeyHour","KeyCategory","TotalComm","TotalValue").orderBy($"KeyHour".asc)
this should give you correct result 这应该给你正确的结果
+-------+-----------+---------+----------+
|KeyHour|KeyCategory|TotalComm|TotalValue|
+-------+-----------+---------+----------+
| 0| cat26| 92.6| 330|
| 0| cat95| 19.6| 300|
| 1| cat23| 69.4| 900|
| 1| cat13| 12.6| 250|
| 1| cat4| 56.0| 1100|
| 2| cat68| 9.8| 100|
| 2| cat40| 29.7| 500|
+-------+-----------+---------+----------+
I hope the answer is helpful 我希望答案是有帮助的
As Ramesh Maharjan has pointed out, the issue lie in using the iterators twice, which will result in the TotalValue
column being 0. However, there is no need to even use groupByKey
and mapGroups
from the beginning. 正如Ramesh Maharjan所指出的那样,问题在于两次使用迭代器,这将导致TotalValue
0。但是,甚至从一开始就不需要使用groupByKey
和mapGroups
。 The same can be acomplished using groupBy
and agg
which will result in much cleaner and easier to read code. 使用groupBy
和agg
可以完成相同的groupBy
,这将使代码更groupBy
,更易于阅读。 And as a plus, it avoids using the slow groupByKey
as well. 另外,它还避免使用慢速的groupByKey
。
The following will work just as well: 以下内容同样适用:
val resultSum = df.groupBy($"Hour", $"Category")
.agg(sum($"TotalComm").as("TotalComm"), sum($"TotalValue").as("TotalValue"))
.orderBy(asc("Hour"))
Result: 结果:
+----+--------+---------+----------+
|Hour|Category|TotalComm|TotalValue|
+----+--------+---------+----------+
| 0| cat95| 19.6| 300|
| 0| cat26| 92.6| 330|
| 1| cat23| 69.4| 900|
| 1| cat13| 12.6| 250|
| 1| cat4| 56.0| 1100|
| 2| cat68| 9.8| 100|
| 2| cat40| 29.7| 500|
+----+--------+---------+----------+
If you still want to change the names of the Hour and Category columns that is easily done by changing the groupBy
to 如果您仍然想要更改“小时”和“类别”列的名称,只需将groupBy
更改为
groupBy($"Hour".as("KeyHour"), $"Category".as("KeyCategory"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.