Spark：数据集上的地图组

Question

I'm trying this mapgroups function on the below dataset and not sure why I'm getting 0 for the "Total Value" column. 我正在尝试在下面的数据集中使用此mapgroups函数，但不确定为什么我的“总值”列为0。 Am I missing something here??? 我在这里想念什么吗？？？ Please advice 请指教

Spark Version - 2.0 Scala Version - 2.11 Spark版本-2.0 Scala版本-2.11

case class Record(Hour: Int, Category: String,TotalComm: Double, TotalValue: Int)
val ss = (SparkSession)
import ss.implicits._

val df: DataFrame = ss.sparkContext.parallelize(Seq(
(0, "cat26", 30.9, 200), (0, "cat26", 22.1, 100), (0, "cat95", 19.6, 300), (1, "cat4", 1.3, 100),
(1, "cat23", 28.5, 100), (1, "cat4", 26.8, 400), (1, "cat13", 12.6, 250), (1, "cat23", 5.3, 300),
(0, "cat26", 39.6, 30), (2, "cat40", 29.7, 500), (1, "cat4", 27.9, 600), (2, "cat68", 9.8, 100),
(1, "cat23", 35.6, 500))).toDF("Hour", "Category","TotalComm", "TotalValue")

val resultSum = df.as[Record].map(row => ((row.Hour,row.Category),(row.TotalComm,row.TotalValue)))
.groupByKey(_._1).mapGroups{case(k,iter) => (k._1,k._2,iter.map(x => x._2._1).sum,iter.map(y => y._2._2).sum)}
.toDF("KeyHour","KeyCategory","TotalComm","TotalValue").orderBy(asc("KeyHour"))

resultSum.show()

+-------+-----------+---------+----------+
|KeyHour|KeyCategory|TotalComm|TotalValue|
+-------+-----------+---------+----------+
|      0|      cat26|     92.6|         0|
|      0|      cat95|     19.6|         0|
|      1|      cat13|     12.6|         0|
|      1|      cat23|     69.4|         0|
|      1|       cat4|     56.0|         0|
|      2|      cat40|     29.7|         0|
|      2|      cat68|      9.8|         0|
+-------+-----------+---------+----------+

Answer 1

iter inside mapGroups is a buffer and computation can be perfomed only once . mapGroups内部的iter是一个缓冲区 ， 计算只能进行一次 。 So when you sum as iter.map(x => x._2._1).sum then there is nothing left in iter buffer and thus iter.map(y => y._2._2).sum operation yields 0 . 因此，当您将iter.map(x => x._2._1).sum ， iter缓冲区中就什么都没有了 ，因此iter.map(y => y._2._2).sum运算得出0 。 So you will have to find a mechanism to calculate sum of both in the same iteration 因此，您将必须找到一种在同一迭代中计算两者之和的机制

for loop with ListBuffers for循环与ListBuffers

for simplicity I have used for loop and ListBuffer to sum both at once 为了简单起见，我使用for loop和ListBuffer来一次求和

val resultSum = df.as[Record].map(row => ((row.Hour,row.Category),(row.TotalComm,row.TotalValue)))
  .groupByKey(_._1).mapGroups{case(k,iter) => {
  val listBuffer1 = new ListBuffer[Double]
  val listBuffer2 = new ListBuffer[Int]
      for(a <- iter){
        listBuffer1 += a._2._1
        listBuffer2 += a._2._2
      }
      (k._1, k._2, listBuffer1.sum, listBuffer2.sum)
    }}
  .toDF("KeyHour","KeyCategory","TotalComm","TotalValue").orderBy($"KeyHour".asc)

this should give you correct result 这应该给你正确的结果

+-------+-----------+---------+----------+
|KeyHour|KeyCategory|TotalComm|TotalValue|
+-------+-----------+---------+----------+
|      0|      cat26|     92.6|       330|
|      0|      cat95|     19.6|       300|
|      1|      cat23|     69.4|       900|
|      1|      cat13|     12.6|       250|
|      1|       cat4|     56.0|      1100|
|      2|      cat68|      9.8|       100|
|      2|      cat40|     29.7|       500|
+-------+-----------+---------+----------+

I hope the answer is helpful 我希望答案是有帮助的

Answer 2

As Ramesh Maharjan has pointed out, the issue lie in using the iterators twice, which will result in the TotalValue column being 0. However, there is no need to even use groupByKey and mapGroups from the beginning. 正如Ramesh Maharjan所指出的那样，问题在于两次使用迭代器，这将导致TotalValue 0。但是，甚至从一开始就不需要使用groupByKey和mapGroups 。 The same can be acomplished using groupBy and agg which will result in much cleaner and easier to read code. 使用groupBy和agg可以完成相同的groupBy ，这将使代码更groupBy ，更易于阅读。 And as a plus, it avoids using the slow groupByKey as well. 另外，它还避免使用慢速的groupByKey 。

The following will work just as well: 以下内容同样适用：

val resultSum = df.groupBy($"Hour", $"Category")
  .agg(sum($"TotalComm").as("TotalComm"), sum($"TotalValue").as("TotalValue"))
  .orderBy(asc("Hour"))

Result: 结果：

+----+--------+---------+----------+
|Hour|Category|TotalComm|TotalValue|
+----+--------+---------+----------+
|   0|   cat95|     19.6|       300|
|   0|   cat26|     92.6|       330|
|   1|   cat23|     69.4|       900|
|   1|   cat13|     12.6|       250|
|   1|    cat4|     56.0|      1100|
|   2|   cat68|      9.8|       100|
|   2|   cat40|     29.7|       500|
+----+--------+---------+----------+

If you still want to change the names of the Hour and Category columns that is easily done by changing the groupBy to 如果您仍然想要更改“小时”和“类别”列的名称，只需将groupBy更改为

groupBy($"Hour".as("KeyHour"), $"Category".as("KeyCategory"))

Spark：数据集上的地图组

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-03-15 05:29:32

解决方案2
1 2018-03-15 05:41:27

Spark：数据集上的地图组

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-03-15 05:29:32

解决方案2 1 2018-03-15 05:41:27

解决方案1
4 已采纳 2018-03-15 05:29:32

解决方案2
1 2018-03-15 05:41:27