根據其他列計算一列中的不同值

Question

我有一個如下所示的表：

app_id  supplier_reached    creation_date   platform
10001       1            9/11/2018         iOS
10001       2            9/18/2018         iOS
10002       1            5/16/2018       android
10003       1            5/6/2018        android
10004       1            10/1/2018       android
10004       1            2/3/2018        android
10004       2            2/2/2018           web
10005       4            1/5/2018           web
10005       2            5/1/2018        android
10006       3            10/1/2018         iOS
10005       4            1/1/2018          iOS

目標是找到每月提交的唯一 app_id 數。

如果我只是做一個count(distinct app_id)我會得到以下結果：

Group by month  count(app number)
     Jan              1
     Feb              1
     may              3
  september           1
   october            2

但是，基於其他領域的組合，應用程序也被認為是獨一無二的。 例如，對於 1 月份， the app_id是相同的，但是app_id 、 supplier_reached和platform的組合顯示不同的值，因此app_id應該計算兩次。 遵循相同的模式，所需的結果應該是：

Group by month  Desired answer
     Jan              2
     Feb              2
     may              3
   september          2
    october           2

最后，表中可以有許多其他列，這些列可能有助於也可能不會有助於應用程序的唯一性。

有沒有辦法在 SQL 中進行這種類型的計數？

我正在使用紅移。

Answer 1

如上所述，在 Redshift 中count(distinct ...)不適用於多個字段。

您可以首先按要唯一的列進行分組，然后像這樣計算記錄：

select month,count(1) as app_number 
from (
    select month,app_id,supplier_reached,platform
    from your_table
    group by 1,2,3,4
)
group by 1

Answer 2

我認為 Postgres 或 Redshift 不支持帶有多個參數的COUNT(DISTINCT) 。 一種解決方法是使用串聯：

count(distinct app_id || ':' || supplier_reached || ':' || platform)

Answer 3

你目標的平均值是錯誤的。

你不想

to find the unique number of app_id submitted per month

你要

to find the unique number of app_id + supplier_reached + platform submitted per month 。

因此，您需要使用 a) 列的組合，例如count(distinct col1||col2||col3)或 b)

select t1.month, count(t1.*)
  (select distinct 
         app_id, 
         supplier_reached,  
         platform, 
         month 
   from sometable) t1
group by month

Answer 4

實際上，您可以在 Postgres 中方便地計算不同的ROW值：

SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM   tbl
GROUP  BY 1;

ROW關鍵字在這里只是噪音：

count(DISTINCT ROW(app_id, supplier_reached, platform))

我不鼓勵為此目的連接列。 這相對昂貴，容易出錯（考慮不同的數據類型和依賴於語言環境的text表示），如果使用的分隔符可以包含在列值中，則會引入極端情況錯誤。

唉， Redshift 不支持：

 ... Value expressions Subscripted expressions Array constructors Row constructors ...

根據其他列計算一列中的不同值

問題描述

4 個解決方案

解決方案1
1 2018-10-03 22:20:03

解決方案2
0 2018-10-03 21:09:31

解決方案3
0 2018-10-03 21:49:55

解決方案4
0 2018-10-03 22:11:08

根據其他列計算一列中的不同值

問題描述

4 個解決方案

解決方案1 1 2018-10-03 22:20:03

解決方案2 0 2018-10-03 21:09:31

解決方案3 0 2018-10-03 21:49:55

解決方案4 0 2018-10-03 22:11:08

解決方案1
1 2018-10-03 22:20:03

解決方案2
0 2018-10-03 21:09:31

解決方案3
0 2018-10-03 21:49:55

解決方案4
0 2018-10-03 22:11:08