简体   繁体   中英

Sub Grouping the results using Apache Spark SQL

I have the following events table, I would like to group them smaller time buckets as specified below.

The table have to be divided into smaller sets where start and end row of the set is determined by geohash if the geohash is same then set keep including the rows until it found next geohash as different.

key time_stamp  geohash
k1  1           abcdfg
k1  5           abcdfg
k1  7           abcdf1
k1  9           abcdfg
k1  10          abcdf2
k1  12          abcdf2
k1  21          abcdf2

How can I produce the following output using Apache Spark SQL syntax

key geohash first_time  last_time   duration    num_events
k1  abcdfg  1           5           4           2
k1  abcdf1  7           7           0           1
k1  abcdfg  9           9           0           1
k1  abcdf2  10          21          11          3

Can someone help me in achieving this.

This is a kind of gaps-and-island problem. Here is one way to solve it using row_number() and aggregation:

select
    key, 
    geohash, 
    min(timestamp) first_time,
    max(timestamp) last_time,
    max(timestamp) - min(timestamp) duration,
    count(*) num_events
from (
    select
        t.*,
        row_number() over(partition by key order by timestamp) rn1,
        row_number() over(partition by key, geohash order by timestamp) rn2
    from mytable t
) t
group by 
    key,
    geohash,
    rn1 - rn2

And, just for the fun of it: you could also do this with a conditional window sum:

select
    key, 
    geohash, 
    min(timestamp) first_time,
    max(timestamp) last_time,
    max(timestamp) - min(timestamp) duration,
    count(*) num_events
from (
    select
        t.*,
        sum(case when lag_geohash = geohash then 0 else 1 end) 
            over(partition by key order by timestamp) grp
    from (
        select
            t.*,
            lag(geohash) over(partition by key order by timestamp) lag_geohash
        from mytable t
    ) t 
) t
group by 
    key,
    geohash,
    grp

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM