簡體   English   中英

Spark SQL的累積不同計數

[英]Cumulative distinct count with Spark SQL

使用Spark 1.6.2。

這里的數據:

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A

我想計算前一天有多少不同的訪客(前一天+累積)(我不知道確切的用語,對不起)。

這應該給:

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3
  • 嘗試過自我加入,但實在太慢了
  • 我確定窗口功能是我正在尋找但沒有設法找到它:/

你應該能夠做到:

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;

實際上,我認為更好的方法是僅在出現的第一天記錄訪問者:

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;

在SQL中,您可以這樣做。

select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt 
           from (select visitorID,min(day) as minday 
                 from tbl 
                 group by visitorID
                ) t 
           group by minday
          ) t 
on t1.day=t.minday
group by t1.day
  • 獲取使用min的visitorID出現的第一天。
  • 計算上面找到的每個這樣的思維行。
  • 左邊將它連接到原始表並獲得累積總和。

另一種方法是

select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt 
from tbl t1
left join (select visitorID,min(day) as minday 
           from tbl 
           group by visitorID
          ) t 
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day

試試吧

select
    day,
    count(*),
    (
    select count(*) from your_table b
    where a.day >= b.day
    ) cumulative
from your_table as a
group by a.day
order by 1 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM