[英]SQL to find the first occurrence of sets of data in a table
Say if I have a table:假设我有一张桌子:
CREATE TABLE T
(
TableDTM TIMESTAMP NOT NULL,
Code INT NOT NULL
);
And I insert some rows:我插入一些行:
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:00:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:10:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:20:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:30:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:40:00', 0);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:50:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:00:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:10:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:20:00', 0);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:30:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:40:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:50:00', 3);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:00:00', 3);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:10:00', 3);
So I end up with a table similar to:所以我最终得到了一个类似于以下的表:
2011-01-13 10:00:00, 5
2011-01-13 10:10:00, 5
2011-01-13 10:20:00, 5
2011-01-13 10:30:00, 5
2011-01-13 10:40:00, 0
2011-01-13 10:50:00, 1
2011-01-13 11:00:00, 1
2011-01-13 11:10:00, 1
2011-01-13 11:20:00, 0
2011-01-13 11:30:00, 5
2011-01-13 11:40:00, 5
2011-01-13 11:50:00, 3
2011-01-13 12:00:00, 3
2011-01-13 12:10:00, 3
How can I select the first date of each set of identical numbers, so I end up with this:我怎样才能 select 每组相同数字的第一个日期,所以我最终得到这个:
2011-01-13 10:00:00, 5
2011-01-13 10:40:00, 0
2011-01-13 10:50:00, 1
2011-01-13 11:20:00, 0
2011-01-13 11:30:00, 5
2011-01-13 11:50:00, 3
I've been messing about with sub queries and the like for most of the day and for some reason I can't seem to crack it.在一天的大部分时间里,我一直在处理子查询之类的问题,但出于某种原因,我似乎无法破解它。 I'm sure there's a simple way somewhere!
我敢肯定在某个地方有一个简单的方法!
I would probably want to exclude the 0's from the results, but that's not important for now..我可能想从结果中排除 0,但现在这并不重要。
I'm sure there's a simple way somewhere! 我确定在某个地方有一个简单的方法!
Yes, there is. 就在这里。 But first, two Issues.
但首先是两个问题。
The table is not a Relational Database table. 该表不是关系数据库表。 It does not have an unique key, which is demanded by the RM and Normalisation (specifically that each row must have an unique identifier; not necessarily a PK).
它没有唯一的密钥,这是RM和规范化所要求的(具体地说,每一行必须具有唯一的标识符;不一定是PK)。 Therefore SQL, a standard language, for operating on Relational Database tables, cannot perform basic operations on it.
因此,SQL(一种用于在关系数据库表上操作的标准语言)无法对其执行基本操作。
So the question really is SQL to find the first occurrence of sets of data in a non-relational Heap . 所以问题实际上是SQL在非关系堆中找到第一组数据集 。
Now if your question was SQL to find the first occurrence of sets of data in a Relational table , implying of course some unique row identifier, that would be (a) easy in SQL, and (b) fast in any flavour of SQL ... 现在,如果你的问题是SQL来查找Relational表中第一次出现的数据集,当然暗示一些唯一的行标识符,这将是(a)在SQL中很容易,以及(b)快速的任何SQL的风格。 。
The question is very generic (no complaint). 这个问题非常通用(没有投诉)。 But many of these specific needs are usually applied within a larger context, and the context has requirements which are absent from the specification here.
但是,这些特定需求中的许多通常在更大的上下文中应用,并且上下文具有本说明书中不存在的要求。 Generally the need is for a simple Subquery (but in Oracle use a Materialised View to avoid the subquery).
通常需要一个简单的子查询(但在Oracle中使用物化视图来避免子查询)。 And the subquery, too, depends on the outer context, the outer query.
子查询也取决于外部上下文,外部查询。 Therefore the answer to the small generic question will not contain the answer to the actual specific need.
因此,小通用问题的答案将不包含实际特定需求的答案。
Anyway, I do not wish to avoid the question. 无论如何,我不想回避这个问题。 Why don't we use a real world example, rather than a simple generic one;
为什么我们不使用现实世界的例子,而不是简单的通用例子; and find the first or last occurrence, or minimum or maximum value, of a set of data, within another set of data, in a Relational table ?
并在Relational表中查找另一组数据中的一组数据的第一个或最后一个或最小值或最大值 ?
Main Query 主要查询
Let's use the ▶Data Model◀ from your previous question. 让我们使用上一个问题中的▶数据模型◀ 。
Report all Alerts
since a certain date, with the peak Value for the duration, that are not Acknowledged
报告自特定日期以来的所有
Alerts
,其持续时间的峰值为未Acknowledged
Since you will be using exactly the same technique (with different table and column names) for all your temporal and History requirements, you need to fully understand the basic construct of a Subquery, and its different applications. 由于您将使用完全相同的技术(具有不同的表和列名称)来满足您的所有时间和历史要求,因此您需要完全理解子查询的基本构造及其不同的应用程序。
Note that you have, not only a pure 5NF Database, with Relational Identifiers (composite keys), you have full Temporal capability throughout, and the temporal requirement is rendered without breaking 5NF (No Update Anomalies), which means the
ValidToDateTime
for periods and durations is derived, and not duplicated in data.请注意,您不仅拥有纯5NF数据库和关系标识符(复合键),而且您具有完整的Temporal功能,并且在不破坏5NF(无更新异常)的情况下呈现时间要求,这意味着
ValidToDateTime
和持续时间的ValidToDateTime
是派生的,而不是在数据中重复。 Point is, that complicates things, hence this is not the best example for a tutorial on Subqueries .点是,这使事情变得复杂,因此这 不是子查询教程的最佳示例 。
First build the Outer query using minimum joins, etc, based on the structure of the result set that you need, and nothing more. 首先根据您需要的结果集的结构 ,使用最小连接等构建外部查询,仅此而已。 It is very important that the structure of the outer query is resolved first;
首先解析外部查询的结构是非常重要的; otherwise you will go back and forth trying to make the subquery fit the outer query, and vice versa.
否则你会来回试图使子查询适合外部查询,反之亦然。
Alerts
after a certain date Alerts
The ▶SQL code◀ required is on page 1 (sorry, the SO edit features are horrible, it destroys the formatting, and the code is already formatted). 需要▶SQL代码◀是第1页(抱歉,SO编辑功能很糟糕,它会破坏格式化,代码已经格式化)。
Then build the Subquery to fill each cell. 然后构建子查询以填充每个单元格。
Subquery (1) Derive Alert.Value
子查询(1)导出
Alert.Value
That is a simple derived datum, select the Value
from the Reading
that generated the Alert
. 这是一个简单的派生数据,从生成
Alert
的Reading
中选择Value
。 The tables are related, the cardinality is 1::1, so it is a straight join on the PK. 这些表是相关的,基数是1 :: 1,所以它是PK上的直接连接。
The ▶SQL code◀ required is on page 2. 需要▶SQL代码◀是第2页。
I have purposely given you a mix of joins in the Outer Query vs obtaining data via Subquery, so that you can learn (you could alternately obtain Alert.Value
via a join, but that would be even more cumbersome ). 我有意在外部查询中添加了多个连接,并通过子查询获取数据,这样您就可以学习(您可以通过连接交替获取
Alert.Value
,但这会更加麻烦 )。
The next Subquery we need derives Alert.PeakValue
. 我们需要的下一个子查询派生
Alert.PeakValue
。 For that we need to determine the Temporal Duration of the Alert
. 为此,我们需要确定
Alert
的时间持续时间。 We have the beginning of the Alert
Duration; 我们有
Alert
持续时间的开始; we need to determine the end of the Duration, which is the next (temporally) Reading.Value
that is within range . 我们需要确定持续时间的结束,这是范围内的下一个 (暂时)
Reading.Value
。 That requires a Subquery as well, which we better handle first. 这也需要一个Subquery,我们最好先处理它。
Subquery (2) Derive Alert.EndDtm
子查询(2)导出
Alert.EndDtm
A slightly more complex Suquery to select the first Reading.ReadingDtm
, that is greater than or equal to the Alert.ReadingDtm
, that has a Reading.Value
which is less than or equal to its Sensor.UpperLimit
. 稍微复杂Suquery以选择第一
Reading.ReadingDtm
,即大于或等于Alert.ReadingDtm
,具有Reading.Value
小于或等于其Sensor.UpperLimit
。
Handling 5NF Temporal Data 处理5NF时态数据
For handling temporal requirements in a 5NF Database (in which EndDateTime
is not stored, as is duplicate data), we work on a StartDateTime
only, and the EndDateTime
is derived : it is the next StartDateTime
. 为了处理5NF数据库中的时间要求(其中未存储
EndDateTime
,如同重复数据),我们仅处理StartDateTime
,并导出 EndDateTime
:它是下一个 StartDateTime
。 This is the Temporal notion of Duration . 这是持续时间的时间概念。
EndDateTime
as simply the Next.StartDateTime
, and ignore the one millisecond issue. EndDateTime
并将其报告为Next.StartDateTime
,并忽略一毫秒的问题。 This.StartDateTime
and < Next.StartDateTime
. This.StartDateTime
和< Next.StartDateTime
。
Sensor.UpperLimit
(ie. watch for it, because both are often located in one WHERE
clause, and it is easy to mix them up or get confused). Sensor.UpperLimit
(即监视它,因为它们通常都位于一个WHERE
子句中,很容易将它们混淆或混淆)。 The ▶SQL code◀ required, along with test data used, is on page 3. 所需的▶SQL代码◀以及使用的测试数据在第3页。
Subquery (3) Derive Alert.PeakValue
子查询(3)导出
Alert.PeakValue
Now it is easy. 现在很容易。 Select the
MAX(Value)
from Readings
between Alert.ReadingDtm
and Alert.EndDtm
, the duration of the Alert
. 从
Alert.ReadingDtm
和Alert.EndDtm
之间的Readings
选择MAX(Value)
,即Alert
的持续时间。
The ▶SQL code◀ required is on page 4. 需要▶SQL代码◀是第4页。
Scalar Subquery 标量子查询
In addition to being Correlated Subqueries, the above are all Scalar Subqueries , as they return a single value; 除了是相关子查询之外,以上都是标量子查询 ,因为它们返回单个值; each cell in the grid can be filled with only one value.
网格中的每个单元格只能填充一个值。 (Non-Scalar Subqueries, that return multiple values, are quite legal, but not for the above.)
(返回多个值的非标量子查询非常合法,但不适用于上述情况。)
Subquery (4) Acknowledged Alerts 子查询(4)已确认的警报
Ok, now that you have a handle on the above Correlated Scalar Subqueries, those that fill cells in a set, a set that is defined by the Outer query, let's look at a Subquery that can be used to constrain the Outer query. 好了,现在你已经掌握了上面的相关标量子查询,那些填充集合中单元格的子集,一个由外部查询定义的集合,让我们看一下可以用来约束外部查询的子查询。 We do not really want all
Alerts
(above), we want Un-Acknowledged Alerts
: the Identifiers that exist in Alert
, that do not exist in Acknowledgement
. 我们并不真正想要所有
Alerts
(上图),我们需要Un-Acknowledged Alerts
: Alert
中存在的标识符,在Acknowledgement
不存在。 That is not filling cells, that is changing the content of the Outer set. 那不是填充单元格,即改变外部集合的内容 。 Of course, that means changing the
WHERE
clause. 当然,这意味着更改
WHERE
子句。
FROM
and existing WHERE
clauses. FROM
和现有 WHERE
子句没有变化。 Simply add a WHERE
condition to exclude the set of Acknowledged Alerts
. 只需添加
WHERE
条件即可排除已Acknowledged Alerts
。 1::1 cardinality, straight Correlated join. 1 :: 1基数,直相关联接。
The ▶SQL code◀ required is on page 5. 需要▶SQL代码◀是第5页。
The difference is, this is a non-Scalar Subquery , producing a set of rows (one column). 不同的是,这是一个非标量子查询 ,产生一组行(一列)。 We have an entire set of
Alerts
(the Outer set) matched against an entire set of Acknowledgements
. 我们有一整套
Alerts
(外部集)与一整套Acknowledgements
相匹配。
1
, because we are performing an existence check. 1
,因为我们正在执行存在检查。 Visualise it as a column added onto the Alert
set defined by the Outer query. Alert
集上的列。 WHERE NOT IN ()
is required, but again, that constructs the defined column set, then compares the two sets. WHERE NOT IN ()
,但同样,构造定义的列集,然后比较两个集。 Much slower. Subquery (5) Actioned Alerts
子查询(5)
Actioned Alerts
As an alternative constraint on the Outer query, for un-actioned Alerts
, instead of (4), exclude the set of Actioned Alerts
. 作为外部查询的替代约束,对于未执行的
Alerts
,而不是(4),排除一组Actioned Alerts
。 Straight Correlated join. 直接相关联接。
The ▶SQL code◀ required is on page 5. 需要▶SQL代码◀是第5页。
This code has been tested on Sybase ASE 15.0.3 using 1000 Alerts
and 200 Acknowledgements
, of different combinations; 此代码已在Sybase ASE 15.0.3上使用1000个
Alerts
和200个已Acknowledgements
的不同组合进行了测试; and the Readings
and Alerts
identified in the document. 以及文件中确定的
Readings
和Alerts
。 Zero milliseconds execution time (0.003 second resolution) for all executions. 所有执行的零毫秒执行时间(0.003秒分辨率)。
If you need it, here is the ▶SQL Code in Text Format◀ . 如果需要,可以使用文本格式的▶SQL代码◀ 。
(6) ▶Register Alert from Reading◀ (6) ▶从阅读◀注册提醒
This code executes in a loop (provided), selecting new Readings
which are out-of-range, and creating Alerts
, except where applicable Alerts
already exist. 此代码在循环(提供)中执行,选择超出范围的新
Readings
,并创建Alerts
,除非适用的Alerts
已存在。
(7) ▶Load Alert From Reading◀ (7) ▶从阅读◀加载警报
Given that you have a full set of test data for Reading
, this code uses a modified form of (6) to load the applicable Alerts
. 鉴于您有一整套用于
Reading
的测试数据,此代码使用修改后的(6)形式加载适用的Alerts
。
It is "simple" when you know how. 当你知道如何时,它是“简单的”。 I repeat, writing SQL without the ability to write Subqueries is very limiting;
我再说一遍,编写没有编写子查询能力的SQL是非常有限的; it is essential for handling Relational Databases, which is what SQL was designed for.
它对于处理关系数据库至关重要,这是SQL的设计目标。
I think you can figure out the remaining queries you have. 我想你可以找出你剩下的查询。
Note, this example also happens to demonstrate the power of using Relational Identifiers , in that several tables in-between the ones we want do not have to be joined (yes! the truth is Relational Identifiers means less, not more, joins, than Id
keys). 注意,这个例子也恰好证明了使用关系标识符的能力 ,因为我们想要的几个表之间不必连接(是的!事实是关系标识符意味着更少,而不是更多,连接,而不是
Id
键)。 Simply follow the solid lines. 只需按照实线。
DateTime
. DateTime
键。 Imagine trying to code the above with Id
PKs, there would be two levels of processing: one for the joins (and there would be far more of them), and another for the data processing. Id
PKs编写上面的代码,会有两个级别的处理:一个用于连接(并且会有更多的连接),另一个用于数据处理。 I try to stay away from colloquial labels ("nested", "inner", etc) because they are not specific, and stick to specific technical terms. 我试图远离口语标签(“嵌套”,“内部”等)因为它们不具体,并坚持特定的技术术语。 For completeness and understanding:
为了完整和理解:
FROM
clause, is a Materialised View , a result set derived in one query and then fed into the FROM
clause of another query, as a "table". FROM
子句之后的子查询,是一个物化视图 ,一个查询中派生的结果集,然后作为“表”输入另一个查询的FROM
子句。
A Subquery in the WHERE
clause is a Predicate Subquery , because it changes the content of the result set (that which it is predicated upon). WHERE
子句中的子查询是谓词子查询 ,因为它更改了结果集的内容(它所基于的内容)。 It can return either a Scalar (one value) or non-Scalar (many values). 它可以返回标量(一个值)或非标量(多个值)。
for Scalars, use WHERE column =
, or any scalar operator 对于Scalars,请使用
WHERE column =
或任何标量运算符
for non-Scalars, use WHERE [NOT] EXISTS
, or WHERE column [NOT] IN
对于非Scalars,使用
WHERE [NOT] EXISTS
或WHERE column [NOT] IN
A Suquery in the WHERE
clause does not need to be Correlated; WHERE
子句中的Suquery 不需要相关; the following works just fine. 以下工作正常。 Identify all superfluous appendages:
识别所有多余的附属物:
SELECT [Never] = FirstName, [Acted] = LastName FROM User WHERE UserId NOT IN ( SELECT DISTINCT UserId FROM Action )
Try this: 试试这个:
SELECT MIN(TableDTM) TableDTM, Code
FROM
(
SELECT T1.TableDTM, T1.Code, MIN(T2.TableDTM) XTableDTM
FROM T T1
LEFT JOIN T T2
ON T1.TableDTM <= T2.TableDTM
AND T1.Code <> T2.Code
GROUP BY T1.TableDTM, T1.Code
) X
GROUP BY XTableDTM, Code
ORDER BY 1;
could you try something like 你可以尝试一下吗?
"SELECT DISTINCT Code, (SELECT MIN(TableDTM) FROM T AS Q WHERE Q.Code = T.Code) As TableDTM FROM T;"
and if you need to exclude the 0, change it in: 如果您需要排除0,请将其更改为:
SELECT DISTINCT Code, (SELECT MIN(TableDTM) FROM T AS Q WHERE Q.Code = T.Code) As TableDTM FROM T WHERE Code <> 0;
Maybe I don't understand the question.也许我不明白这个问题。 But I don't see any mention of Common Table Expression or Analytic Functions.
但我没有看到任何关于公用表表达式或分析函数的提及。 These are my weapons of choice for most problems, and when they can't handle it I start resorting to temporary tables.
这些是我解决大多数问题的首选武器,当它们无法处理时,我开始求助于临时表。
I think, I recently solve a similar problem where I want to get the data of the first occurrence of an error when processing a daily interface file.我想,我最近解决了一个类似的问题,在处理一个日常的接口文件时,想获取第一次出错的数据。 Records on the interface that cause a problem are removed to a set of holding table so the rest of the records can be processed.
接口上出现问题的记录被移除到一组保持表中,以便处理rest条记录。
-- EE with errors removed from most recent batch
with current_batch as (
select employee_number, PVL.ADDITIONAL_INFORMATION
from PERSONNEL_VALIDATION_LOG_X PVL
where PVL.PERSONNEL_BATCH_ID = EMPSRV.CURRENTPERSONNELBATCH(6,900)
)
, hist as (
select
row_number() over (
partition by X.EMPLOYEE_NUMBER, X.ADDITIONAL_INFORMATION
order by B.BATCH_STATUS_DATE
) as RN,
B.PERSONNEL_BATCH_ID BatchId,
B.SUBMITTAL_DATE,
X.EMPLOYEE_NUMBER EMPNUM,
MX.LAST_NAME,
MX.FIRST_NAME,
X.ADDITIONAL_INFORMATION
from PERSONNEL_VALIDATION_LOG_X X
join current_batch C on
X.Employee_number = C.EMPLOYEE_NUMBER
and X.additional_information = C.ADDITIONAL_INFORMATION
join empsrv.personnel_batch B
on B.PERSONNEL_BATCH_ID = X.PERSONNEL_BATCH_ID
join EMPSRV.PERSONNEL_MEMBER_DATA_X MX
on X.PERSONNEL_BATCH_ID = MX.PERSONNEL_BATCH_ID
and X.EMPLOYEE_NUMBER = MX.EMPLOYEE_NUMBER
)
select
batchId,
to_char(submittal_date, 'mm/dd/yyyy') First_Reported,
EmpNum,
Last_name,
first_name,
additional_information
from hist where rn = 1
order by submittal_date desc;
The first CTE just limits the population to current errors.第一个 CTE 只是将总体限制为当前错误。 The hist CTE goes through the logs and picks up the first occurrence of that error (ie. ame EE and messge) This isn't perfect because maybe the error went away and came back, I would get the oldest occurrence and not the start of the most recent sequence.
hist CTE 遍历日志并找出该错误的第一次出现(即 ame EE 和 messge)这并不完美,因为错误可能消失并返回,我会得到最旧的出现而不是开始最近的序列。 But this is good enough and not likely due to the shape of the error message itself.
但这已经足够好了,而且不太可能是由于错误消息本身的形状。 The finally query just picks off the top row of each group which will be the first occurrence.
finally 查询只是选择每个组的第一行,这将是第一次出现。
The query takes a few seconds to run, but my logs are not especially large, so performance is almost never an issue for me ever.查询需要几秒钟才能运行,但我的日志不是特别大,所以性能对我来说几乎从来都不是问题。 I also don't pay much attent to the dates on the questions.
我也不太注意问题的日期。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.