[英]spark dataframe - GroupBy aggregation
I have a dataframe to aggregate one column based on the rest of the other columns.我有一个数据框可以根据其他列的其余部分聚合一列。 I do not want to give all those rest of the columns in groupBy with comma separated as I have about 30 columns.
我不想用逗号分隔 groupBy 中的所有其余列,因为我有大约 30 列。 Could somebody tell me how can I do it in a way that looks more readable.
有人可以告诉我如何以一种看起来更具可读性的方式来完成它。
right now, am doing - df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","c9","c10",....).agg(c11)现在,我在做 - df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","c9","c10", ....).agg(c11)
I want to know if there is any better way..我想知道有没有更好的方法。。
Thanks, John谢谢,约翰
Specifying the columns is the clean way to do it but I believe you have quite a few options.指定列是一种干净的方法,但我相信您有很多选择。
One of them is to go to Spark SQL and compose the query programmatically composing the string.其中之一是转到 Spark SQL 并以编程方式编写查询以组成字符串。
Another option could be to use the varargs : _*
on a list of columns names, like this:另一种选择是在列名列表上使用 varargs
: _*
,如下所示:
val cols = ...
df.groupBy( cols : _*).agg(...)
Use below steps:使用以下步骤:
get the columns as list获取列作为列表
remove the columns needs to be aggregated from the columns list.删除需要从列列表中聚合的列。
apply groupBy & agg.应用 groupBy 和 agg。
**Ex**:
val seq = Seq((101, "abc", 24), (102, "cde", 24), (103, "efg", 22), (104, "ghi", 21), (105, "ijk", 20), (106, "klm", 19), (107, "mno", 18), (108, "pqr", 18), (109, "rst", 26), (110, "tuv", 27), (111, "pqr", 18), (112, "rst", 28), (113, "tuv", 29))
val df = sc.parallelize(seq).toDF("id", "name", "age")
val colsList = df.columns.toList
(colsList: List[String] = List(id, name, age))
val groupByColumns = colsList.slice(0, colsList.size-1)
(groupByColumns: List[String] = List(id, name))
val aggColumn = colsList.last
(aggColumn: String = age)
df.groupBy(groupByColumns.head, groupByColumns.tail:_*).agg(avg(aggColumn)).show
+---+----+--------+
| id|name|avg(age)|
+---+----+--------+
|105| ijk| 20.0|
|108| pqr| 18.0|
|112| rst| 28.0|
|104| ghi| 21.0|
|111| pqr| 18.0|
|113| tuv| 29.0|
|106| klm| 19.0|
|102| cde| 24.0|
|107| mno| 18.0|
|101| abc| 24.0|
|103| efg| 22.0|
|110| tuv| 27.0|
|109| rst| 26.0|
+---+----+--------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.