简体   繁体   English

Postgres SQL - 如何聚合多个连续行的组?

[英]Postgres SQL - How do I aggregated groups of multiple consecutive rows?

I have a data representing tagged continuous-spans in a single table with a <tag, start & stop>.我有一个数据表示带有 <tag, start & stop> 的单个表中的标记连续跨度。

Example below.下面的例子。

I'm trying to combine multiple rows into a single row where the condition is that they create a "continuous span".我试图将多行组合成一行,条件是它们创建一个“连续跨度”。 In the query below - I would like the functionality that LEFT_MOST_CONTINUOUS returns the minimum v_start of a continuous span (same for RIGHT_MOST_CONTINUOUS for maximum v_stop ).在下面的查询中 - 我想要LEFT_MOST_CONTINUOUS返回连续跨度的最小v_start的功能(对于最大v_stopRIGHT_MOST_CONTINUOUS相同)。 Note that there might be more than a single continuous span (that should have different v_start and v_stop values).请注意,可能有多个连续跨度(应具有不同v_startv_stop值)。

Input:输入:

WITH data AS (
    SELECT *
    FROM (VALUES 
        ('a', 2, 3),
        ('a', 3, 5),
        ('a', 5, 7),
        ('a', 8, 10),
        ('a', 10, 12),
        ('a', 12, 14),
        ('b', 7, 8),
        ('b', 8, 10),
        ('b', 12, 15),
        ('c', 10, 11)
    ) AS T(tag, v_start, v_stop)
    ORDER BY tag, v_start, v_stop
)
SELECT tag,
       LEFT_MOST_CONTINUOUS(v_start) OVER (PARTITION BY tag),
       RIGHT_MOST_CONTINUOUS(v_stop) OVER (PARTITION BY tag)
FROM data
ORDER BY 1, 2, 3

Where I expect to get the following output:我希望得到以下 output:

"a" 2   7
"a" 8   14
"b" 7   10
"b" 12  15
"c" 10  11

Since I want to merge the first 3 tuples (for tag "a") which are consecutive into a single value representing the entire span;因为我想将前 3 个连续的元组(用于标记“a”)合并为一个代表整个跨度的单个值; same for the next 3 tuples (again for "a").接下来的 3 个元组相同(同样是“a”)。 Then for "b" we can merge the next 2, but leave out the 3rd (which has it's v_start.= the other's v_stop).然后对于“b”,我们可以合并下一个 2,但忽略第三个(它是 v_start.= 另一个的 v_stop)。 And "c" there is nothing to merge with.和“c”没有什么可以合并的。

Help appreciated,帮助表示赞赏,

Tal塔尔

You can use a gaps-and-islands approach by marking the first record of each group when either there is no previous record for the tag or the v_start is greater than v_stop of the previous record:当标签没有先前的记录或v_start大于先前记录的v_stop时,您可以通过标记每个组的第一条记录来使用间隙和孤岛方法:

select tag, v_start, v_stop, 
         coalesce(lag(v_stop) over w < v_start, true) as is_end_grp
    from data
  window w as (partition by tag order by v_start)

Use a windowed sum() of the boolean is_end_grp cast to int (1 if true, 0 if false) to number the groups:使用 boolean is_end_grpint的加窗sum() (如果为真,则为 0,如果为假)对组进行编号:

  select tag, sum(is_end_grp::int) over (partition by tag 
                                             order by v_start) as grp_num,
         v_start, v_stop
    from mark_gaps

Aggregation over (tag, grp_num) will produce your desired result:聚合(tag, grp_num)将产生您想要的结果:

select tag, min(v_start) as v_start, max(v_stop) as v_stop
  from numbered_groups
 group by tag, grp_num
 order by tag, v_start

Working DB<>Fiddle 工作数据库<>小提琴

Using the numbered_groups logic from @Mike Organek answer.使用来自@Mike Organek 答案的numbered_groups逻辑。 I just started from a different place我刚从另一个地方开始

WITH data AS (
    SELECT *
    , case when lead(v_start) over(partition by tag order by v_start) = v_stop then 0 else 1 end stopcheck
    , case when lag(v_stop) over(partition by tag order by v_stop) = v_start then 0 else 1 end startcheck
    FROM (VALUES 
        ('a' , 2 , 3),
        ('a', 3, 5),
        ('a', 5, 7),
        ('a', 8, 10),
        ('a', 10, 12),
        ('a', 12, 14),
        ('b', 7, 8),
        ('b', 8, 10),
        ('b', 12, 15),
        ('c', 10, 11)
    ) AS T(tag, v_start, v_stop)
    ORDER BY tag, v_start, v_stop
)
,cnt as (
  select *
  , sum(startcheck) over (partition by tag order by v_start) grpn 
  from data)
select c1.tag, c1.v_start, c2.v_stop
from cnt c1 
inner join cnt c2 
  on c1.tag = c2.tag and c1.grpn = c2.grpn
  where c1.startcheck = 1 and c2.stopcheck = 1

This logic is all based on the assumption that your data always starts where the last row left off, there is no overlap etc.这个逻辑都是基于这样的假设,即您的数据总是从最后一行停止的地方开始,没有重叠等。

Create a startcheck and stopcheck by comparing the prior row and next row relatively.通过相对比较前一行和下一行来创建startcheckstopcheck From here use another window function sum() over to order the start records (so we don't match start of second batch to stop of first batch)从这里使用另一个 window function sum() over来订购start记录(所以我们不匹配第二批的开始到第一批的停止)

Join the table to itself matching like tag and groups.将表加入到自身匹配的tag和组中。 Filtering start and stop records过滤开始和停止记录

You can use following query您可以使用以下查询

WITH data AS (
    SELECT *
    FROM (VALUES 
        ('a', 2, 3),
        ('a', 3, 5),
        ('a', 5, 7),
        ('a', 8, 10),
        ('a', 10, 12),
        ('a', 12, 14),
        ('b', 7, 8),
        ('b', 8, 10),
        ('b', 12, 15),
        ('c', 10, 11)
    ) AS T(tag, v_start, v_stop)
    ORDER BY tag, v_start, v_stop
),
cte1 as(
   select *,
      case
        when lag(v_stop)over(partition by tag order by(select null)) = v_start
          then 0
          else 1
      end as grp
  from data
),
cte2 as(
  select *, 
         sum(grp) over (partition by tag  order by v_start) as rnk
  from cte1
)

select tag,min(v_start)v_start,max(v_stop)v_stop
from cte2
group by tag,rnk
order by tag

Demo in db<>fiddle db<>fiddle中的演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM