简体   繁体   English

使用 ValidFrom/ValidTo 日期连接多个表 (SCD2)

[英]Joining multiple tables with ValidFrom/ValidTo dates (SCD2)

Question: How do I JOIN multiple (3+) tables, which all have SCD type 2 validFrom/validTo dates in them?问题:如何加入多个 (3+) 表,这些表中都有 SCD 类型 2 validFrom/validTo 日期?

I have the following tables:我有以下表格:

-- table 1
CREATE TABLE dbo.Clients (
    clientCode    varchar(10) NOT NULL,
    startDate     date NOT NULL,
    [name]        varchar(200) NOT NULL,
    CONSTRAINT PK_Clients PRIMARY KEY CLUSTERED (clientCode, startDate)
);

-- table 2
CREATE TABLE dbo.Projects (
    clientCode    varchar(10) NOT NULL,  --- Each project belongs to a client.
    projectCode   varchar(10) NOT NULL,
    startDate     date NOT NULL,
    [name]        varchar(200) NOT NULL,
    CONSTRAINT PK_Projects PRIMARY KEY CLUSTERED (projectCode, startDate)
);

.. with the following dummy data: .. 使用以下虚拟数据:

-- dummy data
INSERT INTO dbo.Clients (clientCode, startDate, [name])
VALUES ('A', {d '2010-01-01'}, 'Client A (first)'),
       ('A', {d '2011-04-01'}, 'Client A (second)'),
       ('A', {d '2011-09-01'}, 'Client A (third)'),
       ('A', {d '2012-02-01'}, 'Client A (fourth)'),
       ('A', {d '2014-01-01'}, 'Client A (fifth)'),
       ('B', {d '2010-01-01'}, 'Client B (first)'),
       ('B', {d '2011-02-01'}, 'Client B (second)'),
       ('B', {d '2011-08-01'}, 'Client B (third)'),
       ('B', {d '2011-12-01'}, 'Client B (fourth)'),
       ('B', {d '2012-11-01'}, 'Client B (fifth)');

-- dummy data
INSERT INTO dbo.Projects (clientCode, projectCode, startDate, [name])
VALUES ('A', '1', {d '2010-01-15'}, 'Project 1, first revision'),
       ('A', '1', {d '2012-04-22'}, 'Project 1, second revision'),
       ('A', '2', {d '2010-02-08'}, 'Project 2, first revision'),
       ('A', '2', {d '2010-09-12'}, 'Project 2, second revision'),
       ('A', '2', {d '2012-08-18'}, 'Project 2, third revision'),
       ('B', '3', {d '2011-04-01'}, 'Project 3, first revision'),
       ('B', '3', {d '2011-12-01'}, 'Project 3, second revision'),
       ('B', '3', {d '2014-02-28'}, 'Project 3, third revision');

Using these two tables, we generate startDate and endDate intervals:使用这两个表,我们生成 startDate 和 endDate 间隔:

--- Clients:
WITH c (clientCode, [name], startDate, endDate) AS (
    SELECT clientCode, [name], startDate,
           --- Find the next record's startDate, ordered by startDate.
           LEAD(startDate, 1, {d '2099-12-31'}) OVER (
               PARTITION BY clientCode
               ORDER BY startDate) AS endDate
    FROM dbo.Clients),

--- Projects:
     p (projectCode, clientCode, [name], startDate, endDate) AS (
    SELECT projectCode, clientCode, [name], startDate,
           --- Find the next record's startDate, order by startDate
           LEAD(startDate, 1, {d '2099-12-31'}) OVER (
               PARTITION BY projectCode
               ORDER BY startDate) AS endDate
    FROM dbo.Projects)

SELECT c.clientCode, c.[name] AS clientName,
       p.projectCode, p.[name] AS projectName,
       --- Start date is the last of (c.startDate, p.startDate)
       (CASE WHEN c.startDate<p.startDate THEN p.startDate ELSE c.startDate END) AS startDate,
       --- End date is the first of (c.endDate, p.endDate)
       (CASE WHEN c.endDate<p.endDate THEN c.endDate ELSE p.endDate END) AS endDate
FROM c
LEFT JOIN p ON
    c.clientCode=p.clientCode AND
    c.startDate<p.endDate AND
    c.endDate>p.startDate

-- IF two new tables were introducted (t3 and t4), would the following JOINS work?
-- LEFT JOIN dbo.Table3 as t3
-- on p.clientCode = t3.clientcode AND
-- p.startdate<t3.endate AND
-- p.endDate>t3.startdate
-- LEFT JOIN dbo.Table4 as t4
-- on t3.toolId = t4.toolid AND      --> toolId is a new key that I need for the join, since t4 does not have clientCode
-- t3.startdate<t4.enddate AND
-- t3.enddate>t4.startdate
ORDER BY c.clientCode, p.projectCode, 5;

My problem : In the bottom of the above query, I commented out the LEFT JOINS, which I will have to make when more SCD2 tables are introduced.我的问题:在上述查询的底部,我注释掉了 LEFT JOINS,当引入更多 SCD2 表时我将不得不这样做。 I am unsure if the commented out LEFT JOINS i made will work.我不确定我所做的注释掉的 LEFT JOINS 是否会起作用。 Do you see any issues with it?你看到它有什么问题吗?

Adding more JOINS maybe conflicts the CASE WHEN statement used in the above query..:添加更多 JOINS 可能会与上述查询中使用的 CASE WHEN 语句冲突..:

       --- Start date is the last of (c.startDate, p.startDate)
       (CASE WHEN c.startDate<p.startDate THEN p.startDate ELSE c.startDate END) AS startDate,
       --- End date is the first of (c.endDate, p.endDate)
       (CASE WHEN c.endDate<p.endDate THEN c.endDate ELSE p.endDate END) AS endDate

This CASE when statement is used because I want no two intervals to reference the same date.使用这个 CASE when 语句是因为我不希望有两个间隔来引用同一个日期。 So, the output interval is defined by the larger of (a.startTime, b.startTime) and the smaller of (a.endTime, b.endTime).因此,输出间隔由 (a.startTime, b.startTime) 中的较大者和 (a.endTime, b.endTime) 中的较小者定义。

I see an issue here, since this CASE WHEN statement only evaluates startDate and endDate intervals from 2 tables and not 3, 4 or more tables.我在这里看到了一个问题,因为此 CASE WHEN 语句仅评估 2 个表的 startDate 和 endDate 间隔,而不是 3、4 个或更多表。

How can this perhaps be solved?这怎么可能解决?

Would you be interested in using SqlServer's geometry data type to represent time periods?您是否有兴趣使用 SqlServer 的geometry数据类型来表示时间段? Here I applied it to your example:在这里,我将其应用于您的示例:

WITH c (clientCode, [name], Perd) AS (
    SELECT clientCode, [name],
           Perd=geometry::STGeomFromText('LINESTRING (' + format(startdate,'yyyyMMdd')+' 0, '+
                      format(LEAD(startDate, 1, {d '2099-12-31'}) OVER (
                               PARTITION BY clientCode
                               ORDER BY startDate) , 'yyyyMMdd') +' 0)', 0)
    FROM #Clients),

--- Projects:
     p (projectCode, clientCode, [name], Perd) AS (
    SELECT projectCode, clientCode, [name], 
           Perd=geometry::STGeomFromText('LINESTRING (' + format(startdate,'yyyyMMdd')+' 0, '+
                       format(LEAD(startDate, 1, {d '2099-12-31'}) OVER (
                                PARTITION BY projectCode
                                ORDER BY startDate) , 'yyyyMMdd') +' 0)', 0)
    FROM #Projects)
SELECT c.clientCode, c.[name] AS clientName,
       p.projectCode, p.[name] AS projectName,
       startDate=try_cast(format(c.Perd.STIntersection(p.Perd).STEndPoint().STX ,'########') as date),
       endDate=try_cast(format(c.Perd.STIntersection(p.Perd).STStartPoint().STX, '########') as date)
FROM 
    c
    inner join
    p on
    c.clientCode=p.clientCode AND p.Perd.STIntersection(c.Perd).STLength()>0
order by 1,5

This can be easier to nest as a subquery, and join to another temporal table.这可以更容易nest为子查询,并连接到另一个临时表。

I would imagine that this wouldn't be very fast with very large data-sets, though.不过,我想这对于非常大的数据集来说不会很快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM