简体   繁体   English

Spark (JAVA) - 具有多个聚合的数据帧 groupBy?

[英]Spark (JAVA) - dataframe groupBy with multiple aggregations?

I'm trying to write a groupBy on Spark with JAVA.我正在尝试使用 JAVA 在 Spark 上编写 groupBy。 In SQL this would look like在 SQL 中,这看起来像

SELECT id, count(id) as count, max(date) maxdate
FROM table
GROUP BY id;

But what is the Spark/JAVA style equivalent of this query?但是这个查询的 Spark/JAVA 风格是什么? Let's say the variable table is a dataframe, to see the relation to the SQL query.假设变量table是一个数据框,以查看与 SQL 查询的关系。 I'm thinking something like:我在想这样的事情:

table = table.select(table.col("id"), (table.col("id").count()).as("count"), (table.col("date").max()).as("maxdate")).groupby("id")

Which is obviously incorrect, since you can't use aggregate functions like .count or .max on columns, only dataframes.这显然是不正确的,因为您不能在列上使用.count.max等聚合函数,只能使用数据帧。 So how is this done in Spark JAVA?那么这在 Spark JAVA 中是如何完成的呢?

Thank you!谢谢!

You could do this with org.apache.spark.sql.functions :你可以用org.apache.spark.sql.functions做到这一点:

import org.apache.spark.sql.functions;

table.groupBy("id").agg(
    functions.count("id").as("count"),
    functions.max("date").as("maxdate")
).show();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM