[英]count of distinct columns using group by and calculating percentage
嘗試編寫 sql 查詢:
select indicator, count(distinct tid) as tidcount
from coa
group by indicator
下面是正常的 output
indicator tidcount
M 6219
Z 411424
S 1
I 1
對於 tidcounts,我需要按行百分比 output:
我正在嘗試的查詢如下
spark.sql(""" select indicator ,count(tid) as tidcount , round(round(count(indicator)/sum(count(indicator)) over (), 4)* 100, 4) as PERCENTAGE_TOTALS from coa group by indicator """)
indicator tidcount Percentage_total
M 6219 0.72
Z 411424 98.78
S 1 .49
I 1 .02
預計 output 是:
indicator tidcount Percentage_total
M 6219 1.4
Z 411424 98.5
S 1 .0002
I 1 .0002
請建議我是否缺少任何內容,它應該在 spark-sql 或 pyspark 中
spark.sql
的解決方案spark.sql(
"""select
indicator,
COUNT(DISTINCT tid) AS tidcount,
COUNT(DISTINCT tid) / sum(COUNT(DISTINCT tid)) over () * 100 AS PCT
from coa
group by indicator"""
)
pyspark
解決方案w = Window.partitionBy()
(
df
.groupby('indicator')
.agg(F.count_distinct('tid').alias('tidcount'))
.withColumn('PCT', F.col('tidcount') / F.sum('tidcount').over(w) * 100)
)
df.show()
+---------+---+
|indicator|tid|
+---------+---+
| a| 10|
| a| 25|
| a| 7|
| b| 10|
| b| 10|
| c| 25|
| c| 7|
| d| 1|
| a| 2|
| a| 3|
+---------+---+
+---------+--------+-----------------+
|indicator|tidcount| PCT|
+---------+--------+-----------------+
| d| 1|11.11111111111111|
| c| 2|22.22222222222222|
| b| 1|11.11111111111111|
| a| 5|55.55555555555556|
+---------+--------+-----------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.