简体   繁体   English

SQL/BigQuery:如何避免对一个组的多个非连续成员进行分组?

[英]SQL/BigQuery: how to avoid grouping multiple, non-consecutive members of a group?

I'm encountering an issue I can't seem to solve myself.我遇到了一个我自己似乎无法解决的问题。 I am grouping rows by location and timestamp, and finding the first and last timestamps for instances where an entity remained stationary.我按位置和时间戳对行进行分组,并为实体保持静止的实例查找第一个和最后一个时间戳。 The issue is that for my current code, SQL groups together rows when the entity returns to a location it has been before.问题是对于我当前的代码,当实体返回到之前的位置时,SQL 将行组合在一起。

In my example, an entity is at location -66.89 10.5002 at 2020-05-24 05:22:00 and then returns to that location at 2020-05-24 11:13:00.在我的示例中,实体在 2020-05-24 05:22:00 位于位置 -66.89 10.5002,然后在 2020-05-24 11:13:00 返回到该位置。 The result of the current query makes it look like that entity was in that location for the entire time, although the rows in between clearly show it moved.当前查询的结果使该实体看起来一直在该位置,尽管中间的行清楚地表明它已移动。 This is a conceptual problem I really don't know how to solve in SQL.这是一个概念问题,我真的不知道如何在 SQL 中解决。 I'm doing this in Big Query but I remember hitting a similar wall in SQL Server.我在 Big Query 中执行此操作,但我记得在 SQL 服务器中遇到了类似的问题。

Code:代码:

with selection as (
select 1 as id,TIMESTAMP '2020-05-24 11:13:00' as timestamp_, 'POINT(-66.89 10.5002)' as geom
union all select
1,TIMESTAMP '2020-05-24 05:22:00','POINT(-66.89 10.5002)'
union all select
1,TIMESTAMP '2020-05-24 05:25:00','POINT(-66.8881 10.4994)'
union all select
1,TIMESTAMP '2020-05-24 09:14:00','POINT(-66.8888 10.4958)'
union all select
1,TIMESTAMP '2020-05-24 07:37:00 UTC','POINT(-66.8873 10.5)'
union all select
1, TIMESTAMP'2020-05-24 07:52:00 UTC','POINT(-66.8873 10.5)'
)

select id,timestamp_,geom,
first_value(timestamp_)
    OVER (PARTITION BY id,geom ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_start,
last_value(timestamp_)
    OVER (PARTITION BY id,geom ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_end,
FROM
selection order by id,timestamp_

Result.结果。 Note the interval_start and interval_end for the first and last row注意第一行和最后一行的 interval_start 和 interval_end

id ID timestamp_时间戳_ geom几何 interval_start间隔开始 interval_end间隔结束
1 1 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC POINT(-66.89 10.5002)点(-66.89 10.5002) 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC
1 1 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC POINT(-66.8881 10.4994)点(-66.8881 10.4994) 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC
1 1 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC POINT(-66.8873 10.5)点(-66.8873 10.5) 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC
1 1 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC POINT(-66.8873 10.5)点(-66.8873 10.5) 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC
1 1 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC POINT(-66.8888 10.4958)点(-66.8888 10.4958) 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC
1 1 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC POINT(-66.89 10.5002)点(-66.89 10.5002) 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC

Desired result:期望的结果:

id ID timestamp_时间戳_ geom几何 interval_start间隔开始 interval_end间隔结束
1 1 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC POINT(-66.89 10.5002)点(-66.89 10.5002) 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC
1 1 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC POINT(-66.8881 10.4994)点(-66.8881 10.4994) 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC
1 1 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC POINT(-66.8873 10.5)点(-66.8873 10.5) 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC
1 1 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC POINT(-66.8873 10.5)点(-66.8873 10.5) 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC
1 1 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC POINT(-66.8888 10.4958)点(-66.8888 10.4958) 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC
1 1 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC POINT(-66.89 10.5002)点(-66.89 10.5002) 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC

Consider below考虑下面

with selection as (
  select 1 as id,TIMESTAMP '2020-05-24 11:13:00' as timestamp_, 'POINT(-66.89 10.5002)' as geom union all select
  1,TIMESTAMP '2020-05-24 05:22:00','POINT(-66.89 10.5002)' union all select
  1,TIMESTAMP '2020-05-24 05:25:00','POINT(-66.8881 10.4994)' union all select
  1,TIMESTAMP '2020-05-24 09:14:00','POINT(-66.8888 10.4958)' union all select
  1,TIMESTAMP '2020-05-24 07:37:00 UTC','POINT(-66.8873 10.5)' union all select
  1, TIMESTAMP'2020-05-24 07:52:00 UTC','POINT(-66.8873 10.5)'
), pregrouped_selection as (
  select id, timestamp_, geom, 
    countif(flag) over(partition by id order by timestamp_) grp
  from (
    select id, timestamp_, geom,
      geom != ifnull(lag(geom) over(partition by id order by timestamp_), geom) flag,
    from selection 
  )
  order by id, timestamp_
)
select id,timestamp_,geom,
first_value(timestamp_)
    OVER (PARTITION BY id,grp ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_start,
last_value(timestamp_)
    OVER (PARTITION BY id,grp ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_end,
FROM
pregrouped_selection order by id,timestamp_    

with output与 output

在此处输入图像描述

As you can see - I left your original query almost 100% as is - just replaced geom to grp inside over() statement AND from pregrouped_selection which does calculate group number - grp正如您所看到的 - 我几乎 100% 保留了原始查询 - 只是将geom替换为over()语句中的grp并从pregrouped_selection计算组号 - grp

You can check if there are at least two distinct values using window functions:您可以使用 window 函数检查是否至少有两个不同的值:

min(geom) over (partition by id) <> max(geom) over (partition by id) as has_moved,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM