简体   繁体   English

将 LEAD 与条件(?)一起使用 - 雪花

[英]Using LEAD with condition(?) - snowflake

I have +10k IDs and, for each one of them, there are 10 zones;我有 +10k 个 ID,每个 ID 都有 10 个区域; each zone can be affected in some way每个区域都会以某种方式受到影响

I want to count the time duration that each zone was affected for each ID, ordered by day (considering last week as a whole)我想计算每个 ID 每个区域受到影响的持续时间,按天排序(考虑整个上周)

To know if/when a zone was affected, the column AFFECTED_ZONE will return a value from 1 to 10 (determining which zone was the one)要知道某个区域是否/何时受到影响, AFFECTED_ZONE列将返回 1 到 10 之间的值(确定哪个区域是那个区域)

I know the zone was normalized once the next row within AFFECTED_ZONE is 0我知道一旦AFFECTED_ZONE中的下一行为 0,区域就会被标准化

So, for example, it looks a little like this:因此,例如,它看起来有点像这样:

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个
2022-12-21 15:03:00 2022-12-21 15:03:00 1 1个 0 0
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 3 3个
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 0 0
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 0 0
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 4 4个
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0

In this case, the zone 1 from ID 1 was affected at 15:00:00 and was normalized at 15:03:00 - overall affected time should be 3 min ;在这种情况下,来自 ID 1 的区域1在 15:00:00 受到影响,并在 15:03:00 归一化 - 总体影响时间应为3 分钟 same thing with zone 4 in this example (affected at 16:43:00 and normalized at 17:00:00 - overall affected time should be 17 min )与此示例中的区域4相同(在 16:43:00 受到影响并在 17:00:00 归一化 - 总体影响时间应为17 分钟

For zone 3 , the affectation happened at 15:15:00 and was normalized at 15:25:00 (first 0) and we had another 0 at a posterior time that we do not consider - overall affected time should be 10 min对于 zone 3 ,影响发生在 15:15:00,并在 15:25:00(第一个 0)归一化,我们在我们不考虑的后期时间有另一个 0 - 总体影响时间应该是10 分钟

The problem is that, sometimes, it can look like this:问题是,有时它看起来像这样:

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个
2022-12-21 15:03:00 2022-12-21 15:03:00 1 1个 1 1个
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 0 0
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 6 6个
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 4 4个
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 3 3个
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0

In this case, the zone 1 from ID 1 was affected at 15:00:00 and was normalized at 15:15:00, however the 1 showed up again at 15:03:00, but it should be desconsidered since the same zone has already been affected since 15:00:00 - overall affected time should be 15 min在这种情况下,来自 ID 1 的区域1在 15:00:00 受到影响并在 15:15:00 归一化,但是 1 在 15:03:00 再次出现,但应该取消考虑,因为同一个区域自 15:00:00 以来已经受到影响 - 总体影响时间应为15 分钟

After this, zones 6 , 4 and 3 were affected in a row, and normalization only came at 17:00:00;此后, 6区、 4区、 3区连续受到影响,17:00:00才恢复正常; the overall afected times for each zone, respectively, should be 95 min , 60 min and 17 min每个区域的总受影响时间分别应为95 分钟60 分钟17 分钟

I can't figure this second part out.我无法弄清楚第二部分。 At first, I separated the dates of each event (affectation and normalization) like this:起初,我像这样分开每个事件的日期(影响和规范化):

case when affectation_zone <> 0 then date end as affected_at,
case when affectation_zone = 0 then date end as normal_at

Then, I added a LEAD() function so that I could subtract the AFFECTED_AT date from the NORMAL_AT date and thus find the overall affected time, like this:然后,我添加了一个 LEAD() function,这样我就可以从NORMAL_AT日期中减去AFFECTED_AT日期,从而找到总体受影响的时间,如下所示:

datediff(minutes, affected_at, lead(normal_at) over (partition by id order by date)) as lead

It works just fine for the first scenario它适用于第一种情况

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE AFFECTED_AT影响_AT NORMAL_AT正常_AT LEAD带领
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个 2022-12-21 15:00:00 2022-12-21 15:00:00 null null 3 3个
2022-12-21 15:03:00 2022-12-21 15:03:00 1 1个 0 0 null null 2022-12-21 15:03:00 2022-12-21 15:03:00 null null
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 3 3个 2022-1-21 15:15:00 2022-1-21 15:15:00 null null 10 10
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 0 0 null null 2022-12-21 15:25:00 2022-12-21 15:25:00 null null
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 0 0 null null 2022-12-21 16:00:00 2022-12-21 16:00:00 null null
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 4 4个 2022-12-21 16:43:00 2022-12-21 16:43:00 null null 17 17
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0 null null 2022-12-21 17:00:00 2022-12-21 17:00:00 null null

However, for the second one, the LEAD() only considers the last row in which the AFFECTED_AT column is not null, desconsidering the other ones, like this:但是,对于第二个,LEAD() 仅考虑AFFECTED_AT列不是 null 的最后一行,不考虑其他行,如下所示:

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE AFFECTED_AT影响_AT NORMAL_AT正常_AT LEAD带领
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个 2022-12-21 15:00:00 2022-12-21 15:00:00 null null null null
2022-12-21 15:03:00 2022-12-21 15:03:00 1 1个 1 1个 2022-12-21 15:03:00 2022-12-21 15:03:00 null null 12 12
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 0 0 null null 2022-12-21 15:15:00 2022-12-21 15:15:00 null null
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 6 6个 2022-12-21 15:25:00 2022-12-21 15:25:00 null null null null
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 4 4个 2022-12-21 16:00:00 2022-12-21 16:00:00 null null null null
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 3 3个 2022-12-21 16:43:00 2022-12-21 16:43:00 null null 17 17
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0 null null 2022-12-21 17:00:00 2022-12-21 17:00:00 null null

I could ignore nulls with the LEAD() function, and it would work well for the cases in which there are different zones one after the other, but it wouldn't work in cases in which the same zone repeats itself, as I would be adding unnecessary time, for example:我可以使用 LEAD() function 忽略空值,它适用于一个接一个地有不同区域的情况,但在同一个区域重复出现的情况下它不起作用,就像我一样增加不必要的时间,例如:

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE AFFECTED_AT影响_AT NORMAL_AT正常_AT LEAD带领
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个 2022-12-21 15:00:00 2022-12-21 15:00:00 null null 15 15
2022-12-21 15:03:00 2022-12-21 15:03:00 1 1个 1 1个 2022-12-21 15:03:00 2022-12-21 15:03:00 null null 12 12
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 0 0 null null 2022-12-21 15:15:00 2022-12-21 15:15:00 null null
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 6 6个 2022-12-21 15:25:00 2022-12-21 15:25:00 null null 95 95
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 4 4个 2022-12-21 16:00:00 2022-12-21 16:00:00 null null 60 60
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 3 3个 2022-12-21 16:43:00 2022-12-21 16:43:00 null null 17 17
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0 null null 2022-12-21 17:00:00 2022-12-21 17:00:00 null null

the overall affection time for zone 1 should be 15 min, but if I add everything it would be 23 min区域1的整体情感时间应该是 15 分钟,但如果我把所有东西都加起来,那就是 23 分钟

Any ideas on how to solve this?关于如何解决这个问题的任何想法? I'm no expert on Snowflake/SQL (quite on the contrary) so I would much appreciate it!!我不是 Snowflake/SQL 方面的专家(恰恰相反),所以我将不胜感激!!

I can think of two possible approaches, second probably the best but I'll let you decide:我可以想到两种可能的方法,第二种可能是最好的,但我会让你决定:

1 - Remove Extra Records 1 - 删除额外记录

Assuming, based on your question, that an ID can only affect an AFFECTED_ZONE once (each occurrence possibly including multiple records).根据您的问题,假设一个ID只能影响AFFECTED_ZONE一次(每次出现可能包括多条记录)。 ie IE

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个
2022-12-21 15:03:00 2022-12-21 15:03:00 1 1个 1 1个
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 0 0
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 6 6个
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 4 4个
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 3 3个
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0

and not并不是

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个
2022-12-21 15:03:00 2022-12-21 15:03:00 1 1个 1 1个
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 0 0
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 1 1个
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 0 0
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 3 3个
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0

We could use a LAG function to find each records previous AFFECTED_ZONE and remove those with the same ID and AFFECTED_ZONE - while ignoring where AFFECTED_ZONE = 0 .我们可以使用LAG function 来查找之前的每条记录AFFECTED_ZONE并删除具有相同IDAFFECTED_ZONE的记录 - 同时忽略where AFFECTED_ZONE = 0的位置。 If you do have more than one occurrence of an ID, AFFECTED_ZONE pairing, this process would merge them together.如果您确实多次出现ID, AFFECTED_ZONE配对,此过程会将它们合并在一起。

select foo.id,
       foo.date,
       foo.affected_zone
  from (select id,
               date,
               affected_zone,
               lag(affected_zone,1) over (partition by id
                                              order by date) prev_affected_zone
          from your_table) foo
 where ifnull(foo.affected_zone,-1) != ifnull(foo.prev_affected_zone,-1)
    or ifnull(foo.affected_zone,-1)  = 0

This approach will give you something like这种方法会给你类似的东西

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 0 0
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 6 6个
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 4 4个
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 3 3个
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0

Allowing you to use your existing LEAD允许您使用现有的LEAD

2 - Use FIRST_VALUE instead of LEAD 2 - 使用 FIRST_VALUE 而不是 LEAD

Use your current process but replace LEAD with FIRST_VALUE .使用您当前的流程,但将LEAD替换为FIRST_VALUE

FIRST_VALUE will select the first value in an ordered group of values, so we can ignore nulls and return the first normal_at value after our current row. FIRST_VALUE将 select 作为有序值组中的第一个值,因此我们可以忽略空值并返回当前行之后的第一个normal_at值。

first_value 第一个值

select date,
       id,
       affected_zone,
       affected_at,
       first_value(normal_at ignore nulls) over (partition by id
                                                     order by date 
                                                      rows between current row and unbound following) normal_at
  from (select id,
               date,
               affected_zone,
               case when affected_zone != 0 then date end  affected_at,
               case when affected_zone  = 0 then date end  normal_at
          from your_table) foo

This should give you:这应该给你:

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE AFFECTED_AT影响_AT NORMAL_AT正常_AT
2022-12-21 15:00:00 2022-12-21 15:00:00 1 1个 1 1个 2022-12-21 15:00:00 2022-12-21 15:00:00 2022-12-21 15:15:00 2022-12-21 15:15:00
2022-12-21 15:03:00 2022-12-21 15:03:00 1 1个 1 1个 2022-12-21 15:03:00 2022-12-21 15:03:00 2022-12-21 15:15:00 2022-12-21 15:15:00
2022-12-21 15:15:00 2022-12-21 15:15:00 1 1个 0 0 null null 2022-12-21 15:15:00 2022-12-21 15:15:00
2022-12-21 15:25:00 2022-12-21 15:25:00 1 1个 6 6个 2022-12-21 15:25:00 2022-12-21 15:25:00 null null
2022-12-21 16:00:00 2022-12-21 16:00:00 1 1个 4 4个 2022-12-21 16:00:00 2022-12-21 16:00:00 null null
2022-12-21 16:43:00 2022-12-21 16:43:00 1 1个 3 3个 2022-12-21 16:43:00 2022-12-21 16:43:00 2022-12-21 17:00:00 2022-12-21 17:00:00
2022-12-21 17:00:00 2022-12-21 17:00:00 1 1个 0 0 null null 2022-12-21 17:00:00 2022-12-21 17:00:00

You can then do your duration calculation and select the first record for each ID, AFFECTED_ZONE pairing, probably with a ROW_NUMBER .然后,您可以计算持续时间和 select 每个ID, AFFECTED_ZONE的第一条记录,AFFECTED_ZONE 配对,可能带有ROW_NUMBER

I would approach this as a gaps-and-islands problem, which gives us a lot of flexibility to address the various use cases.我会将其视为一个间隙和孤岛问题,这为我们提供了很大的灵活性来解决各种用例。

My pick would be to define groups of adjacent records that start with one or more affected zones and end with a normalization ( affected_zone = 0 ), using window functions:我的选择是使用 window 函数定义一组相邻记录,这些记录以一个或多个受影响的区域开始并以规范化 ( affected_zone = 0 ) 结束:

select t.*,
    sum(case when lag_affected_zone = 0 then 1 else 0 end) over(partition by id order by date) grp
from (
    select t.*,
        lag(affected_zone, 1, 0) over(partition by id order by date) lag_affected_zone
    from mytable t
) t

Starting with a mix of some of the data you provided, that I hope represents the different use cases, this returns:从您提供的一些数据的混合开始,我希望代表不同的用例,这将返回:

DATE日期 ID ID AFFECTED_ZONE AFFECTED_ZONE lag_affected_zone滞后影响区 grp
2022-12-21 15:00:00.000 2022-12-21 15:00:00.000 1 1个 1 1个 0 0 1 1个
2022-12-21 15:03:00.000 2022-12-21 15:03:00.000 1 1个 1 1个 1 1个 1 1个
2022-12-21 15:15:00.000 2022-12-21 15:15:00.000 1 1个 0 0 1 1个 1 1个
2022-12-21 15:17:00.000 2022-12-21 15:17:00.000 1 1个 0 0 0 0 2 2个
2022-12-21 15:25:00.000 2022-12-21 15:25:00.000 1 1个 6 6个 0 0 3 3个
2022-12-21 16:00:00.000 2022-12-21 16:00:00.000 1 1个 4 4个 6 6个 3 3个
2022-12-21 16:43:00.000 2022-12-21 16:43:00.000 1 1个 3 3个 4 4个 3 3个
2022-12-21 16:50:00.000 2022-12-21 16:50:00.000 1 1个 1 1个 3 3个 3 3个
2022-12-21 17:00:00.000 2022-12-21 17:00:00.000 1 1个 0 0 1 1个 3 3个

You can see how records are being grouped together to form consistend islands.您可以看到记录是如何组合在一起形成一致孤岛的。 Now we can work on each group: we want to bring the earliest date of each affected zone in the group, and compare it to the latest date of the group (which corresponds to the normalization step);现在我们可以对每个组进行处理:我们想要将每个受影响区域的最早日期带入组中,并将其与该组的最新日期进行比较(对应于规范化步骤); we can use aggregation:我们可以使用聚合:

select *
from (
    select id, affected_zone, min(date) affected_at, max(max(date)) over(partition by grp) normalized_at
    from (
        select t.*,
            sum(case when lag_affected_zone = 0 then 1 else 0 end) over(partition by id order by date) grp
        from (
            select t.*,
                lag(affected_zone, 1, 0) over(partition by id order by date) lag_affected_zone
            from mytable t
        ) t
    ) t
    group by id, affected_zone, grp
) t
where affected_zone != 0
order by id, affected_at
id ID affected_zone受影响的区域 affected_at受影响的 normalized_at标准化_at
1 1个 1 1个 2022-12-21 15:00:00.000 2022-12-21 15:00:00.000 2022-12-21 15:15:00.000 2022-12-21 15:15:00.000
1 1个 6 6个 2022-12-21 15:25:00.000 2022-12-21 15:25:00.000 2022-12-21 17:00:00.000 2022-12-21 17:00:00.000
1 1个 4 4个 2022-12-21 16:00:00.000 2022-12-21 16:00:00.000 2022-12-21 17:00:00.000 2022-12-21 17:00:00.000
1 1个 3 3个 2022-12-21 16:43:00.000 2022-12-21 16:43:00.000 2022-12-21 17:00:00.000 2022-12-21 17:00:00.000
1 1个 1 1个 2022-12-21 16:50:00.000 2022-12-21 16:50:00.000 2022-12-21 17:00:00.000 2022-12-21 17:00:00.000

Here is a demo on DB Fiddle : this is SQL Server, but uses standard SQL that Snowflakes supports as well.这是DB Fiddle 上的演示:这是 SQL 服务器,但使用 Snowflakes 也支持的标准 SQL。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM