简体   繁体   English

Spark:数据集上的地图组

[英]Spark: Mapgroups on a Dataset

I'm trying this mapgroups function on the below dataset and not sure why I'm getting 0 for the "Total Value" column. 我正在尝试在下面的数据集中使用此mapgroups函数,但不确定为什么我的“总值”列为0。 Am I missing something here??? 我在这里想念什么吗??? Please advice 请指教

Spark Version - 2.0 Scala Version - 2.11 Spark版本-2.0 Scala版本-2.11

case class Record(Hour: Int, Category: String,TotalComm: Double, TotalValue: Int)
val ss = (SparkSession)
import ss.implicits._

val df: DataFrame = ss.sparkContext.parallelize(Seq(
(0, "cat26", 30.9, 200), (0, "cat26", 22.1, 100), (0, "cat95", 19.6, 300), (1, "cat4", 1.3, 100),
(1, "cat23", 28.5, 100), (1, "cat4", 26.8, 400), (1, "cat13", 12.6, 250), (1, "cat23", 5.3, 300),
(0, "cat26", 39.6, 30), (2, "cat40", 29.7, 500), (1, "cat4", 27.9, 600), (2, "cat68", 9.8, 100),
(1, "cat23", 35.6, 500))).toDF("Hour", "Category","TotalComm", "TotalValue")

val resultSum = df.as[Record].map(row => ((row.Hour,row.Category),(row.TotalComm,row.TotalValue)))
.groupByKey(_._1).mapGroups{case(k,iter) => (k._1,k._2,iter.map(x => x._2._1).sum,iter.map(y => y._2._2).sum)}
.toDF("KeyHour","KeyCategory","TotalComm","TotalValue").orderBy(asc("KeyHour"))

resultSum.show()

+-------+-----------+---------+----------+
|KeyHour|KeyCategory|TotalComm|TotalValue|
+-------+-----------+---------+----------+
|      0|      cat26|     92.6|         0|
|      0|      cat95|     19.6|         0|
|      1|      cat13|     12.6|         0|
|      1|      cat23|     69.4|         0|
|      1|       cat4|     56.0|         0|
|      2|      cat40|     29.7|         0|
|      2|      cat68|      9.8|         0|
+-------+-----------+---------+----------+  

iter inside mapGroups is a buffer and computation can be perfomed only once . mapGroups内部的iter是一个缓冲区计算只能进行一次 So when you sum as iter.map(x => x._2._1).sum then there is nothing left in iter buffer and thus iter.map(y => y._2._2).sum operation yields 0 . 因此,当您将iter.map(x => x._2._1).sum iter缓冲区中就什么都没有了 ,因此iter.map(y => y._2._2).sum运算得出0 So you will have to find a mechanism to calculate sum of both in the same iteration 因此,您将必须找到一种在同一迭代中计算两者之和的机制

for loop with ListBuffers for循环与ListBuffers

for simplicity I have used for loop and ListBuffer to sum both at once 为了简单起见,我使用for loop和ListBuffer来一次求和

val resultSum = df.as[Record].map(row => ((row.Hour,row.Category),(row.TotalComm,row.TotalValue)))
  .groupByKey(_._1).mapGroups{case(k,iter) => {
  val listBuffer1 = new ListBuffer[Double]
  val listBuffer2 = new ListBuffer[Int]
      for(a <- iter){
        listBuffer1 += a._2._1
        listBuffer2 += a._2._2
      }
      (k._1, k._2, listBuffer1.sum, listBuffer2.sum)
    }}
  .toDF("KeyHour","KeyCategory","TotalComm","TotalValue").orderBy($"KeyHour".asc)

this should give you correct result 这应该给你正确的结果

+-------+-----------+---------+----------+
|KeyHour|KeyCategory|TotalComm|TotalValue|
+-------+-----------+---------+----------+
|      0|      cat26|     92.6|       330|
|      0|      cat95|     19.6|       300|
|      1|      cat23|     69.4|       900|
|      1|      cat13|     12.6|       250|
|      1|       cat4|     56.0|      1100|
|      2|      cat68|      9.8|       100|
|      2|      cat40|     29.7|       500|
+-------+-----------+---------+----------+

I hope the answer is helpful 我希望答案是有帮助的

As Ramesh Maharjan has pointed out, the issue lie in using the iterators twice, which will result in the TotalValue column being 0. However, there is no need to even use groupByKey and mapGroups from the beginning. 正如Ramesh Maharjan所指出的那样,问题在于两次使用迭代器,这将导致TotalValue 0。但是,甚至从一开始就不需要使用groupByKeymapGroups The same can be acomplished using groupBy and agg which will result in much cleaner and easier to read code. 使用groupByagg可以完成相同的groupBy ,这将使代码更groupBy ,更易于阅读。 And as a plus, it avoids using the slow groupByKey as well. 另外,它还避免使用慢速的groupByKey

The following will work just as well: 以下内容同样适用:

val resultSum = df.groupBy($"Hour", $"Category")
  .agg(sum($"TotalComm").as("TotalComm"), sum($"TotalValue").as("TotalValue"))
  .orderBy(asc("Hour"))

Result: 结果:

+----+--------+---------+----------+
|Hour|Category|TotalComm|TotalValue|
+----+--------+---------+----------+
|   0|   cat95|     19.6|       300|
|   0|   cat26|     92.6|       330|
|   1|   cat23|     69.4|       900|
|   1|   cat13|     12.6|       250|
|   1|    cat4|     56.0|      1100|
|   2|   cat68|      9.8|       100|
|   2|   cat40|     29.7|       500|
+----+--------+---------+----------+

If you still want to change the names of the Hour and Category columns that is easily done by changing the groupBy to 如果您仍然想要更改“小时”和“类别”列的名称,只需将groupBy更改为

groupBy($"Hour".as("KeyHour"), $"Category".as("KeyCategory"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM