Spark (JAVA) - 具有多个聚合的数据帧 groupBy？

Question

I'm trying to write a groupBy on Spark with JAVA.我正在尝试使用 JAVA 在 Spark 上编写 groupBy。 In SQL this would look like在 SQL 中，这看起来像

SELECT id, count(id) as count, max(date) maxdate
FROM table
GROUP BY id;

But what is the Spark/JAVA style equivalent of this query?但是这个查询的 Spark/JAVA 风格是什么？ Let's say the variable table is a dataframe, to see the relation to the SQL query.假设变量table是一个数据框，以查看与 SQL 查询的关系。 I'm thinking something like:我在想这样的事情：

table = table.select(table.col("id"), (table.col("id").count()).as("count"), (table.col("date").max()).as("maxdate")).groupby("id")

Which is obviously incorrect, since you can't use aggregate functions like .count or .max on columns, only dataframes.这显然是不正确的，因为您不能在列上使用.count或.max等聚合函数，只能使用数据帧。 So how is this done in Spark JAVA?那么这在 Spark JAVA 中是如何完成的呢？

Thank you!谢谢！

Answer 1

You could do this with org.apache.spark.sql.functions :你可以用org.apache.spark.sql.functions做到这一点：

import org.apache.spark.sql.functions;

table.groupBy("id").agg(
    functions.count("id").as("count"),
    functions.max("date").as("maxdate")
).show();

Spark (JAVA) - 具有多个聚合的数据帧 groupBy？

问题描述

1 个解决方案

解决方案1
21 已采纳 2016-07-15 13:14:43

Spark (JAVA) - 具有多个聚合的数据帧 groupBy？

问题描述

1 个解决方案

解决方案1 21 已采纳 2016-07-15 13:14:43

解决方案1
21 已采纳 2016-07-15 13:14:43