[英]How to aggregate on the distinct count of a column in spark SQL that put it as a new column?
spark.sql(f"""
INSERT INTO {databaseName}.{tableName}
SELECT
'{runDate}'
, client_id
, COUNT(DISTINCT client_id) AS distinct_count_client_id
FROM df """)
所以說我有一個 client_id 列有重復值,我想有一個客戶 ID 的聚合不同計數列,我將如何在 pyspark 中實現它? 上面的代碼不起作用。
您可以使用 HAVING。 試試這個代碼:
spark.sql(f"""
INSERT INTO {databaseName}.{tableName}
SELECT
'{runDate}'
, client_id
, COUNT(*) AS client_id
group by client_id HAVING COUNT(client_id) > 1
FROM df """)
您可以使用size
和collect_set
函數來實現 count distinct function。
spark.sql(f"""
insert into {databaseName}.{tableName}
select
'{runDate}'
,client_id
,size(collect_set(client_id) over (partition by null)) as distinct_count_client_id
from df
"""
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.