使用分組依據和計算百分比的不同列的計數

Question

嘗試編寫 sql 查詢：

select indicator, count(distinct tid) as tidcount
from coa
group by indicator

下面是正常的 output

indicator tidcount
M             6219
Z             411424
S             1
I             1

對於 tidcounts，我需要按行百分比 output：

我正在嘗試的查詢如下

spark.sql(""" select indicator ,count(tid) as tidcount , round(round(count(indicator)/sum(count(indicator)) over (), 4)* 100, 4) as PERCENTAGE_TOTALS from coa group by indicator """)

indicator tidcount    Percentage_total
M             6219        0.72
Z             411424      98.78
S             1           .49
I             1           .02

預計 output 是：

indicator tidcount    Percentage_total
M             6219        1.4
Z             411424      98.5
S             1           .0002
I             1           .0002

請建議我是否缺少任何內容，它應該在 spark-sql 或 pyspark 中

Answer 1

使用`spark.sql`的解決方案

spark.sql(
    """select 
           indicator,
           COUNT(DISTINCT tid) AS tidcount,
           COUNT(DISTINCT tid) / sum(COUNT(DISTINCT tid)) over () * 100 AS PCT 
       from coa 
       group by indicator"""
)

`pyspark`解決方案

w = Window.partitionBy()

(
    df
    .groupby('indicator')
    .agg(F.count_distinct('tid').alias('tidcount'))
    .withColumn('PCT', F.col('tidcount') / F.sum('tidcount').over(w) * 100)
)

例子

df.show()

+---------+---+
|indicator|tid|
+---------+---+
|        a| 10|
|        a| 25|
|        a|  7|
|        b| 10|
|        b| 10|
|        c| 25|
|        c|  7|
|        d|  1|
|        a|  2|
|        a|  3|
+---------+---+

結果

+---------+--------+-----------------+
|indicator|tidcount|              PCT|
+---------+--------+-----------------+
|        d|       1|11.11111111111111|
|        c|       2|22.22222222222222|
|        b|       1|11.11111111111111|
|        a|       5|55.55555555555556|
+---------+--------+-----------------+

使用分組依據和計算百分比的不同列的計數

問題描述

1 個解決方案

解決方案1
0 2022-08-26 19:17:36

使用`spark.sql`的解決方案

`pyspark`解決方案

例子

結果

使用分組依據和計算百分比的不同列的計數

問題描述

1 個解決方案

解決方案1 0 2022-08-26 19:17:36

使用spark.sql的解決方案

pyspark解決方案

例子

結果

解決方案1
0 2022-08-26 19:17:36

使用`spark.sql`的解決方案

`pyspark`解決方案