簡體   English   中英

如何聚合 spark SQL 中將其作為新列的列的非重復計數?

[英]How to aggregate on the distinct count of a column in spark SQL that put it as a new column?

spark.sql(f""" 
          INSERT INTO {databaseName}.{tableName} 
          SELECT 
              '{runDate}'
            , client_id
            , COUNT(DISTINCT client_id) AS distinct_count_client_id
          FROM df """) 

所以說我有一個 client_id 列有重復值,我想有一個客戶 ID 的聚合不同計數列,我將如何在 pyspark 中實現它? 上面的代碼不起作用。

您可以使用 HAVING。 試試這個代碼:

spark.sql(f""" 
      INSERT INTO {databaseName}.{tableName} 
      SELECT 
          '{runDate}'
        , client_id
        , COUNT(*) AS client_id
        group by client_id HAVING COUNT(client_id) > 1
      FROM df """) 

您可以使用sizecollect_set函數來實現 count distinct function。

spark.sql(f""" 
          insert into {databaseName}.{tableName} 
          select 
              '{runDate}'
              ,client_id
              ,size(collect_set(client_id) over (partition by null)) as distinct_count_client_id
          from df
          """
)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM