简体   繁体   English

SQL:根据值序列拉取行

[英]SQL: Pull rows based on sequence of values

I need to pull rows of data based on the existence of certain values that exist in a specific sequence.我需要根据特定序列中存在的某些值的存在来提取数据行。

Here's an example of the data:以下是数据示例:

Header标题 EventId事件 ID EventDate活动日期
67891882 67891882 382 382 2022-01-21 09:29:50.000 2022-01-21 09:29:50.000
67891882 67891882 81 81 2022-01-21 09:03:23.000 2022-01-21 09:03:23.000
67891882 67891882 273 273 2022-01-21 09:03:51.000 2022-01-21 09:03:51.000
67891882 67891882 77 77 2022-01-21 09:05:58.000 2022-01-21 09:05:58.000
67891882 67891882 2 2 2022-01-21 09:29:48.000 2022-01-21 09:29:48.000

The results I need are to capture the Header and the EventDate for EventId=81.我需要的结果是为 EventId=81 捕获 Header 和 EventDate。 Further criteria include:其他标准包括:

  • EventID 81 is the "start" and EventID 77 is the "end" EventID 81 是“开始”,EventID 77 是“结束”
  • Any number of other events can exist between these two with the exception of (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)除了 (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199) 之外,这两者之间可以存在任何数量的其他事件

So in the example above, Eventid 81 with EventDate 2022-01-21 09:03:23.000 would qualify as a row I want to pull as 273 is not in the exception list.因此,在上面的示例中,Eventid 81 和 EventDate 2022-01-21 09:03:23.000将有资格作为我想要提取的行,因为 273 不在例外列表中。

ATTEMPT: I have tried the following query尝试:我尝试了以下查询

SELECT *
FROM #Table
WHERE EventDate BETWEEN (SELECT EventDate
                         FROM #Table
                         WHERE EventId = 81)
                    AND (SELECT eventdate
                         FROM #Table
                         WHERE EventId = 77)
    AND EventId NOT IN (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)
ORDER BY 3

But I was immediately confronted with the fact that my sub-queries return more than one result, so this won't work (I was using this to test a singular Header # example, which worked fine).但是我立即面临这样一个事实,即我的子查询返回多个结果,所以这不起作用(我用它来测试一个单一的 Header # 示例,它工作得很好)。 So now I'm not quite sure how to proceed.所以现在我不太确定如何进行。 I'd hate to think that I'd be forced to use a CURSOR , mostly because my source data is comprised of 266 million rows.我不想认为我会被迫使用CURSOR ,主要是因为我的源数据包含 2.66 亿行。

I had also previously tried using the LAG() function to find my "starting point", but that possibility seemed to dissipate once the request started becoming more and more complex (with the addition of the exclusion list as well as the fact that there could be 1 or 40 rows in between the 81 and 77).我之前也尝试过使用LAG()函数来找到我的“起点”,但是一旦请求开始变得越来越复杂(添加了排除列表以及可以在 81 和 77 之间为 1 或 40 行)。


How should I proceed with this?我应该如何处理这个? Here's some example data to play with.这是一些可以使用的示例数据。 The Header can be thought of as a parent key, associated with any number of EventID (representing a specific action) and the EventDate with when this occurred:可以将 Header 视为父键,与任意数量的 EventID(表示特定操作)和 EventDate 相关联:

create table #data (header int, eventid int, eventdate datetime)

insert into #data 
values
('62252595',    '22',   '5/23/2021  12:34:02 PM'),
('62252595',    '81',   '5/23/2021  12:34:03 PM'),
('62252595',    '29',   '5/23/2021  12:34:12 PM'),
('62252595',    '40',   '5/23/2021  12:34:27 PM'),
('62252595',    '22',   '5/23/2021  12:35:02 PM'),
('62252595',    '22',   '5/23/2021  12:36:12 PM'),
('62252595',    '37',   '5/23/2021  12:36:36 PM'),
('62252595',    '77',   '5/23/2021  12:37:04 PM'),
('62252595',    '6',    '5/23/2021  12:37:52 PM'),
('63252595',    '39',   '5/23/2021  12:38:01 PM'),
('63252595',    '81',   '5/23/2021  12:38:04 PM'),
('63252595',    '37',   '5/23/2021  12:38:06 PM'),
('63252595',    '21',   '5/23/2021  12:38:09 PM'),
('63252595',    '75',   '5/23/2021  12:38:10 PM'),
('63252595',    '77',   '5/23/2021  12:38:12 PM'),
('64252595',    '29',   '5/23/2021  12:38:15 PM'),
('64252595',    '26',   '5/23/2021  12:38:18 PM'),
('64252595',    '81',   '5/23/2021  12:38:20 PM'),
('64252595',    '40',   '5/23/2021  12:38:21 PM'),
('64252595',    '81',   '5/23/2021  12:38:24 PM'),
('64252595',    '83',   '5/23/2021  12:39:06 PM'),
('64252595',    '77',   '5/23/2021  12:39:07 PM'),
('65252595',    '41',   '5/23/2021  12:39:12 PM'),
('65252595',    '81',   '5/23/2021  12:39:16 PM'),
('65252595',    '37',   '5/23/2021  12:39:20 PM'),
('65252595',    '18',   '5/23/2021  12:39:56 PM'),
('65252595',    '18',   '5/23/2021  12:40:03 PM'),
('65252595',    '77',   '5/23/2021  12:40:15 PM'),
('65252595',    '36',   '5/23/2021  12:40:46 PM'),
('65252595',    '77',   '5/23/2021  12:40:53 PM')

EXPECTED RESULTS: From this #Data table, the results I would expect to see would be:预期结果:从这个#Data表中,我希望看到的结果是:

Header标题 EventId事件 ID EventDate活动日期
62252595 62252595 81 81 5/23/2021 12:34:03 PM 2021 年 5 月 23 日下午 12:34:03
65252595 65252595 81 81 5/23/2021 12:39:16 PM 2021 年 5 月 23 日下午 12:39:16

Header #'s 63252595 and 64252595 would not qualify because between the first instance of 81 and the first instance of 77 (partition by Header order by EventDate), there exists a 75 at 5/23/2021 12:38:10 PM and an 83 at 5/23/2021 12:39:06 PM respectively (both of which in exclusion list). Header # 的 63252595 和 64252595 不符合条件,因为在 81 的第一个实例和 77 的第一个实例之间(按 EventDate 的 Header 顺序分区),在5/23/2021 12:38:10 PM存在一个 75 和一个83 在5/23/2021 12:39:06 PM分别(两者都在排除列表中)。 I hope this clears up some confusion.我希望这能消除一些困惑。


EDIT: After some thinking, I wonder if it would be possible to simplify this using a CASE expression.编辑:经过一番思考,我想知道是否可以使用CASE表达式来简化它。 Using the example data from the #Data table above, I wrote this query:使用上面#Data表中的示例数据,我编写了以下查询:

select *
from (
    select * from (
        select *, id=case when EventId = 81 then 1 
                            when EventId = 77 then 2
                            when EventId in (60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199) then 5 else 0 end
        from #data) a
    where id <> 0)b
    order by 3

What this does is filters out all of the 'allowable' events and makes it so that I can filter to only see the unencumbered events where id =1 and then follows with a 2. What I'm not sure of as of yet is how to get it to show me only entries of id =1 with a following 2.这样做是过滤掉所有“允许的”事件并使其过滤,以便我可以过滤以仅查看id = 1 的未占用事件,然后是 2。我目前还不确定如何让它只向我显示id =1 的条目,后面是 2。

I'm going to assume that a start event (81) always starts a new "frame" from that row onwards, and an end event (77) always starts a new frame from the following row onwards.我将假设开始事件 (81) 总是从该行开始一个新的“帧”,而结束事件 (77) 总是从下一行开始一个新帧。

I'm also going to assume that you're only interested in frames where both a start and end event are present, and that the frame contains no excepted events (I'll just use 00 as random allowable events and 199 as the only excepted event) .我还将假设您只对同时存在开始和结束事件的帧感兴趣,并且该帧不包含任何异常事件(我将仅使用 00 作为随机允许事件,将 199 作为唯一例外事件)

For example...例如...

[81,00,81,00,77,00,81,199,77]
=> frame 0 = [81,00]
=> frame 1 = [81,00,77]
=> frame 2 = [00]
=> frame 3 = [81,199,77]

In that example only the 2nd frame's start event would be returned (the others missing start and/or end events, or containing the excepted event).在该示例中,仅返回第 2 帧的开始事件(其他帧缺少开始和/或结束事件,或包含例外事件)。

WITH
  frame_start AS
(
  SELECT
    *,
    CASE
      WHEN
        81 = eventid
      OR
        77 = LAG(eventid) OVER (PARTITION BY header ORDER BY eventdate)
      THEN
        1
      ELSE
        0
    END
      AS new_frame
  FROM
    #data
),
  framed AS
(
  SELECT
    *,
    SUM(new_frame) OVER (PARTITION BY header ORDER BY eventdate) AS frame_id
  FROM
    frame_start
)
SELECT
  header, MIN(eventdate)
FROM
  framed
GROUP BY
  header, frame_id
HAVING
  SUM(CASE WHEN eventid IN (81,77) THEN 1 ELSE 0 END) = 2
  AND
  MAX(CASE WHEN eventid IN (199, etc) THEN 1 ELSE 0 END) = 0

Demo : https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=d54493c87629e3e59759ac9d119ec6ad演示: https ://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=d54493c87629e3e59759ac9d119ec6ad


Explanation:解释:

The first CTE adds a column called new_frame .第一个 CTE 添加了一个名为new_frame的列。

  • 1 = current row is 81,or previous row is 77 1 = 当前行是 81,或者前一行是 77
  • 0 = everything else 0 = 其他一切

This marks the start of each new frame (as described at the top here).这标志着每个新帧的开始(如此处顶部所述)。

The next CTE assigns an id to every row in each frame, by cumulatively summing the new_frame, in datetime order.下一个 CTE 通过按日期时间顺序对 new_frame 进行累积求和,为每一帧中的每一行分配一个 id。 The id starts at 0, then is incremented on each row by that row's new_frame value (if new_frame=0, keep the same id as the previous row, if new_frame=1 increment the id by 1). id 从 0 开始,然后在每一行上增加该行的 new_frame 值(如果 new_frame=0,保持与前一行相同的 id,如果 new_frame=1 将 id 增加 1)。

At this point the header's rows are broken down in to frames (as described at the top here).此时,标题的行被分解为帧(如此处顶部所述)。

The final query groups by the frame and then filters the results with a HAVING clause.最终查询按帧分组,然后使用 HAVING 子句过滤结果。 The first check is that the number of rows in the frame with 81 or 77 must total 2. The second check is that no rows in the frame can have an excepted event.第一个检查是帧中包含 81 或 77 的行数必须总计为 2。第二个检查是帧中的行不能有异常事件。 If all checks pass, return the minimum timestamp in the frame, which by definition comes from the first row in the frame.如果所有检查都通过,则返回帧中的最小时间戳,根据定义,该时间戳来自帧中的第一行。

It would be handy to see the actual expected results for the sample data so I don't actually know if this is correct, it looks like you just need to calculate date ranges per each header:查看示例数据的实际预期结果会很方便,所以我实际上不知道这是否正确,看起来您只需要计算每个标题的日期范围:

with h as (
  select *, 
    Min(case when eventid=81 then eventdate end) over(partition by header) Sdate, 
    Max(case when eventid=77 then eventdate end) over(partition by header) Edate
  from #data
)
select header, eventId, EventDate
from h
where eventdate between SDate and EDate
and EventId not in(60, 72, 73, 74, 75, 76, 83, 85, 86, 87, 103, 154, 166, 197, 199)
order by eventdate;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM