[英]SQL/BigQuery: how to avoid grouping multiple, non-consecutive members of a group?
I'm encountering an issue I can't seem to solve myself.我遇到了一个我自己似乎无法解决的问题。 I am grouping rows by location and timestamp, and finding the first and last timestamps for instances where an entity remained stationary.我按位置和时间戳对行进行分组,并为实体保持静止的实例查找第一个和最后一个时间戳。 The issue is that for my current code, SQL groups together rows when the entity returns to a location it has been before.问题是对于我当前的代码,当实体返回到之前的位置时,SQL 将行组合在一起。
In my example, an entity is at location -66.89 10.5002 at 2020-05-24 05:22:00 and then returns to that location at 2020-05-24 11:13:00.在我的示例中,实体在 2020-05-24 05:22:00 位于位置 -66.89 10.5002,然后在 2020-05-24 11:13:00 返回到该位置。 The result of the current query makes it look like that entity was in that location for the entire time, although the rows in between clearly show it moved.当前查询的结果使该实体看起来一直在该位置,尽管中间的行清楚地表明它已移动。 This is a conceptual problem I really don't know how to solve in SQL.这是一个概念问题,我真的不知道如何在 SQL 中解决。 I'm doing this in Big Query but I remember hitting a similar wall in SQL Server.我在 Big Query 中执行此操作,但我记得在 SQL 服务器中遇到了类似的问题。
Code:代码:
with selection as (
select 1 as id,TIMESTAMP '2020-05-24 11:13:00' as timestamp_, 'POINT(-66.89 10.5002)' as geom
union all select
1,TIMESTAMP '2020-05-24 05:22:00','POINT(-66.89 10.5002)'
union all select
1,TIMESTAMP '2020-05-24 05:25:00','POINT(-66.8881 10.4994)'
union all select
1,TIMESTAMP '2020-05-24 09:14:00','POINT(-66.8888 10.4958)'
union all select
1,TIMESTAMP '2020-05-24 07:37:00 UTC','POINT(-66.8873 10.5)'
union all select
1, TIMESTAMP'2020-05-24 07:52:00 UTC','POINT(-66.8873 10.5)'
)
select id,timestamp_,geom,
first_value(timestamp_)
OVER (PARTITION BY id,geom ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_start,
last_value(timestamp_)
OVER (PARTITION BY id,geom ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_end,
FROM
selection order by id,timestamp_
Result.结果。 Note the interval_start and interval_end for the first and last row注意第一行和最后一行的 interval_start 和 interval_end
id ID | timestamp_时间戳_ | geom几何 | interval_start间隔开始 | interval_end间隔结束 |
---|---|---|---|---|
1 1 | 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC | POINT(-66.89 10.5002)点(-66.89 10.5002) | 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC | 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC |
1 1 | 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC | POINT(-66.8881 10.4994)点(-66.8881 10.4994) | 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC | 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC |
1 1 | 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC | POINT(-66.8873 10.5)点(-66.8873 10.5) | 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC | 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC |
1 1 | 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC | POINT(-66.8873 10.5)点(-66.8873 10.5) | 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC | 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC |
1 1 | 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC | POINT(-66.8888 10.4958)点(-66.8888 10.4958) | 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC | 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC |
1 1 | 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC | POINT(-66.89 10.5002)点(-66.89 10.5002) | 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC | 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC |
Desired result:期望的结果:
id ID | timestamp_时间戳_ | geom几何 | interval_start间隔开始 | interval_end间隔结束 |
---|---|---|---|---|
1 1 | 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC | POINT(-66.89 10.5002)点(-66.89 10.5002) | 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC | 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC |
1 1 | 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC | POINT(-66.8881 10.4994)点(-66.8881 10.4994) | 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC | 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC |
1 1 | 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC | POINT(-66.8873 10.5)点(-66.8873 10.5) | 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC | 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC |
1 1 | 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC | POINT(-66.8873 10.5)点(-66.8873 10.5) | 2020-05-24 07:37:00 UTC 2020-05-24 07:37:00 UTC | 2020-05-24 07:52:00 UTC 2020-05-24 07:52:00 UTC |
1 1 | 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC | POINT(-66.8888 10.4958)点(-66.8888 10.4958) | 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC | 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC |
1 1 | 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC | POINT(-66.89 10.5002)点(-66.89 10.5002) | 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC | 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC |
Consider below考虑下面
with selection as (
select 1 as id,TIMESTAMP '2020-05-24 11:13:00' as timestamp_, 'POINT(-66.89 10.5002)' as geom union all select
1,TIMESTAMP '2020-05-24 05:22:00','POINT(-66.89 10.5002)' union all select
1,TIMESTAMP '2020-05-24 05:25:00','POINT(-66.8881 10.4994)' union all select
1,TIMESTAMP '2020-05-24 09:14:00','POINT(-66.8888 10.4958)' union all select
1,TIMESTAMP '2020-05-24 07:37:00 UTC','POINT(-66.8873 10.5)' union all select
1, TIMESTAMP'2020-05-24 07:52:00 UTC','POINT(-66.8873 10.5)'
), pregrouped_selection as (
select id, timestamp_, geom,
countif(flag) over(partition by id order by timestamp_) grp
from (
select id, timestamp_, geom,
geom != ifnull(lag(geom) over(partition by id order by timestamp_), geom) flag,
from selection
)
order by id, timestamp_
)
select id,timestamp_,geom,
first_value(timestamp_)
OVER (PARTITION BY id,grp ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_start,
last_value(timestamp_)
OVER (PARTITION BY id,grp ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_end,
FROM
pregrouped_selection order by id,timestamp_
with output与 output
As you can see - I left your original query almost 100% as is - just replaced geom
to grp
inside over()
statement AND from pregrouped_selection
which does calculate group number - grp正如您所看到的 - 我几乎 100% 保留了原始查询 - 只是将geom
替换为over()
语句中的grp
并从pregrouped_selection
计算组号 - grp
You can check if there are at least two distinct values using window functions:您可以使用 window 函数检查是否至少有两个不同的值:
min(geom) over (partition by id) <> max(geom) over (partition by id) as has_moved,
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.