繁体   English   中英

Spark SQL的累积不同计数

[英]Cumulative distinct count with Spark SQL

使用Spark 1.6.2。

这里的数据:

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A

我想计算前一天有多少不同的访客(前一天+累积)(我不知道确切的用语,对不起)。

这应该给:

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3
  • 尝试过自我加入,但实在太慢了
  • 我确定窗口功能是我正在寻找但没有设法找到它:/

你应该能够做到:

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;

实际上,我认为更好的方法是仅在出现的第一天记录访问者:

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;

在SQL中,您可以这样做。

select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt 
           from (select visitorID,min(day) as minday 
                 from tbl 
                 group by visitorID
                ) t 
           group by minday
          ) t 
on t1.day=t.minday
group by t1.day
  • 获取使用min的visitorID出现的第一天。
  • 计算上面找到的每个这样的思维行。
  • 左边将它连接到原始表并获得累积总和。

另一种方法是

select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt 
from tbl t1
left join (select visitorID,min(day) as minday 
           from tbl 
           group by visitorID
          ) t 
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day

试试吧

select
    day,
    count(*),
    (
    select count(*) from your_table b
    where a.day >= b.day
    ) cumulative
from your_table as a
group by a.day
order by 1 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM