简体   繁体   English

在Spark 2.2.0和Scala中对数据框的列求和

[英]Sum the column of a data frame in Spark 2.2.0 and Scala

I am trying to sum columns in the following data frame in Spark/Scala, which was itself created through another data frame. 我试图在Spark / Scala中的以下数据框中对列进行求和,它本身是通过另一个数据框创建的。 I was using this answer as a guide: How to sum the values of one column of a dataframe in spark/scala 我使用此答案作为指南: 如何对spark / scala中数据框的一列的值求和

Here's my data, created from another aggregate function and assigned to a data frame: 这是我的数据,从另一个聚合函数创建并分配给数据框:

+-------------+----+----+
|activityLabel| 1_3|4_12|
+-------------+----+----+
|           12|1075|   0|
|            1|   0|3072|
|            6|3072|   0|
|            3|   0|3072|
|            5|3072|   0|
|            9|3072|   0|
|            4|3072|   0|
|            8|3379|   0|
|            7|3072|   0|
|           10|3072|   0|
|           11|3072|   0|
|            2|   0|3072|
+-------------+----+----+

And here's my code to create the dataframe: 这是我创建数据帧的代码:

def createRangeActivityLabels(df: DataFrame): Unit = {

  val activityRange: List[(Int, Int)] = List((1, 3), (4, 12))

  val exprs: List[Column] = activityRange.map {
    case (x, y) => {
      val newLabel = s"${x}_${y}"
      sum(when($"activityLabel".between(x, y), 0).otherwise(1)).alias(newLabel)
    }
  }

  val df3: DataFrame = df.groupBy($"activityLabel").agg(exprs.head, exprs.tail: _*)
  df3.show

And here's the code to get the sum. 这是获得总和的代码。 All I want to do is sum the columns labelled as 1_3 (exprs.head) and 4_12 (exprs(1)) 我想做的就是将标记为1_3(exprs.head)和4_12(exprs(1))的列相加

  val indexedLabel0: Int = df3.agg(sum(exprs.head)).first.getAs[Int](0)
}

I get the following error: org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. 我收到以下错误:org.apache.spark.sql.AnalysisException:不允许在另一个聚合函数的参数中使用聚合函数。 Please use the inner aggregate function in a sub-query.;; 请在子查询中使用内部聚合函数。

I have tried multiple solutions to fix this but nothing seems to work. 我尝试了多种解决方案来解决这个问题,但似乎没有任何效果。 All ideas appreciated. 所有想法都赞赏。 Thanks! 谢谢!

The problem is that exprs.head evaluate to sum(when($"activityLabel".between(x, y), 0).otherwise(1)).alias(newLabel) . 问题是exprs.head求值为sum(when($"activityLabel".between(x, y), 0).otherwise(1)).alias(newLabel) And when you try to sum(exprs.head), it will evaluate sum of sum. 当你尝试求和(exprs.head)时,它将评估总和。

I think you only need column name. 我想你只需要列名。

val columnsName: List[Column] = activityRange.map {
    case (x, y) => $"${x}_${y}"
}
val indexedLabel0 = df3.agg(sum(columnsName.head)).first.getAs[Long](0)

@user8371915 Thanks to correct me about return type @ user8371915感谢关于返回类型的正确答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM