[英]Spark scala window count max
我有以下df:-
结果 | state | 俱乐部名称 |
---|---|---|
赢 | XYZ | 俱乐部1 |
赢 | XYZ | 俱乐部2 |
赢 | XYZ | 俱乐部1 |
赢 | 二维码 | 俱乐部3 |
我需要 state wise max wining clubName
val byState =Window.partitionBy("state").orderBy('state)
我尝试创建一个 window 但没有帮助..
预期结果:-
sql 中的一些这样的
select temp.res
(select count(result) as res
from table
group by clubName) temp
group by state
例如
state | max_count_of_wins | 俱乐部名称 |
---|---|---|
XYZ | 2 | 俱乐部1 |
您可以获取每个俱乐部的获胜次数,然后为按获胜次数排序的每个俱乐部分配一个排名,然后过滤排名 = 1 的行。
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"wins",
count(when(col("result") === "win", 1))
.over(Window.partitionBy("state","clubName"))
).withColumn(
"rn",
row_number().over(Window.partitionBy("state").orderBy(desc("wins")))
).filter("rn = 1").selectExpr("state", "wins as max_count_of_wins", "clubName")
df2.show
+-----+-----------------+--------+
|state|max_count_of_wins|clubName|
+-----+-----------------+--------+
| PQR| 1| club3|
| XYZ| 2| club1|
+-----+-----------------+--------+
您还可以在 SparkSQL 中使用 SQL 方言(在 此处查找文档):
df.sql("""
SELECT tt.name, tt.state, MAX(tt.nWins) as max_count_of_wins
FROM (
SELECT t1.clubName as name, t1.state as state, COUNT(1) as nWins
FROM Table1 t1
WHERE t1.result = 'win'
GROUP BY state, name
) as tt
GROUP BY tt.state;
""")
其中 dataframe 中的表将命名为Table1
和您的 dataframe df
。
ps 如果您想自己尝试,请使用初始化
CREATE TABLE Table1
(`result` varchar(3), `state` varchar(3), `clubName` varchar(5))
;
INSERT INTO Table1
(`result`, `state`, `clubName`)
VALUES
('win', 'XYZ', 'club1'),
('win', 'XYZ', 'club2'),
('win', 'XYZ', 'club1'),
('win', 'PQR', 'club3')
;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.