使用 Apache Spark SQL 對結果進行子分組

Question

我有以下事件表，我想按照下面指定的方式將它們分組為較小的時間段。

該表必須分成較小的集合，其中集合的開始和結束行由 geohash 確定，如果 geohash 相同，則設置保持包括行，直到發現下一個 geohash 不同。

key time_stamp  geohash
k1  1           abcdfg
k1  5           abcdfg
k1  7           abcdf1
k1  9           abcdfg
k1  10          abcdf2
k1  12          abcdf2
k1  21          abcdf2

如何使用 Apache Spark SQL 語法生成以下輸出

key geohash first_time  last_time   duration    num_events
k1  abcdfg  1           5           4           2
k1  abcdf1  7           7           0           1
k1  abcdfg  9           9           0           1
k1  abcdf2  10          21          11          3

有人可以幫助我實現這一目標。

Answer 1

這是一種縫隙和孤島問題。 這是使用row_number()和聚合解決它的一種方法：

select
    key, 
    geohash, 
    min(timestamp) first_time,
    max(timestamp) last_time,
    max(timestamp) - min(timestamp) duration,
    count(*) num_events
from (
    select
        t.*,
        row_number() over(partition by key order by timestamp) rn1,
        row_number() over(partition by key, geohash order by timestamp) rn2
    from mytable t
) t
group by 
    key,
    geohash,
    rn1 - rn2

而且，只是為了好玩：你也可以用條件窗口總和來做到這一點：

select
    key, 
    geohash, 
    min(timestamp) first_time,
    max(timestamp) last_time,
    max(timestamp) - min(timestamp) duration,
    count(*) num_events
from (
    select
        t.*,
        sum(case when lag_geohash = geohash then 0 else 1 end) 
            over(partition by key order by timestamp) grp
    from (
        select
            t.*,
            lag(geohash) over(partition by key order by timestamp) lag_geohash
        from mytable t
    ) t 
) t
group by 
    key,
    geohash,
    grp

使用 Apache Spark SQL 對結果進行子分組

問題描述

1 個解決方案

解決方案1
4 已采納 2019-12-18 14:41:12

使用 Apache Spark SQL 對結果進行子分組

問題描述

1 個解決方案

解決方案1 4 已采納 2019-12-18 14:41:12

解決方案1
4 已采納 2019-12-18 14:41:12