简体   繁体   中英

Cumulative distinct count with Spark SQL

Using Spark 1.6.2.

Here the data:

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A

I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry).

This should give:

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3
  • Tried self-join but really too slow
  • I am sure windowed function is what I am looking for but didnt manage to find it :/

You should be able to do:

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;

Actually, I think a better approach is to record a visitor only on the first day s/he appears:

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;

In SQL, you could do this.

select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt 
           from (select visitorID,min(day) as minday 
                 from tbl 
                 group by visitorID
                ) t 
           group by minday
          ) t 
on t1.day=t.minday
group by t1.day
  • Get the first day a visitorID appears using min .
  • Count the rows per such minday found above.
  • Left join this to your original table and get the cumulative sum.

Another approach would be

select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt 
from tbl t1
left join (select visitorID,min(day) as minday 
           from tbl 
           group by visitorID
          ) t 
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day

Try it's

select
    day,
    count(*),
    (
    select count(*) from your_table b
    where a.day >= b.day
    ) cumulative
from your_table as a
group by a.day
order by 1 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM