[英]deduplication of slowly changing dimension data
通过下面的代码,我尝试描述我的问题。 基本上我想从缓慢变化的维度数据中得到 go:
一些去重数据:
显然我的实际数据更大(更多 ID 和值)。 在理想情况下,我还想尽可能避免使用#Dates 帮助程序表。 我目前的尝试:
SELECT
ID
, Value1
, Value2
, MIN(#Dates.TheDate) AS StartDate
, MAX(#Dates.TheDate) AS EndDate
FROM #Dates
INNER JOIN #Haves ON #Dates.TheDate BETWEEN #Haves.StartDate AND #Haves.EndDate
GROUP BY
ID
, Value1
, Value2
不产生需求。
复制代码:
IF OBJECT_ID(N'tempdb..#Dates') IS NOT NULL DROP TABLE #Dates;
IF OBJECT_ID(N'tempdb..#Haves') IS NOT NULL DROP TABLE #Haves;
IF OBJECT_ID(N'tempdb..#Wants') IS NOT NULL DROP TABLE #Wants;
DECLARE @FromDate DATETIME, @ToDate DATETIME;
SET @FromDate = '2020-01-01';
SET @ToDate = '2020-01-31';
-- all days in that period
SELECT TOP (DATEDIFF(DAY, @FromDate, @ToDate)+1)
TheDate = DATEADD(DAY, number, @FromDate)
INTO #Dates
FROM [master].dbo.spt_values
WHERE [type] = N'P' ORDER BY number;
SELECT * FROM #Dates
SELECT
*
INTO #Haves
FROM (SELECT 1 ID, '2020-01-01' AS StartDate, '2020-01-03' AS EndDate, 1 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-03' AS StartDate, '2020-01-05' AS EndDate, 1 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-05' AS StartDate, '2020-01-07' AS EndDate, 3 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-07' AS StartDate, '2999-01-01' AS EndDate, 1 Value1, 1 Value2
) AS IQ1;
SELECT * from #Haves
SELECT
ID
, Value1
, Value2
, MIN(#Dates.TheDate) AS StartDate
, MAX(#Dates.TheDate) AS EndDate
FROM #Dates
INNER JOIN #Haves ON #Dates.TheDate BETWEEN #Haves.StartDate AND #Haves.EndDate
GROUP BY
ID
, Value1
, Value2
SELECT
*
INTO #Wants
FROM (SELECT 1 ID, '2020-01-01' AS StartDate, '2020-01-05' AS EndDate, 1 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-05' AS StartDate, '2020-01-07' AS EndDate, 3 Value1, 1 Value2
UNION
SELECT 1 ID, '2020-01-07' AS StartDate, '2999-01-01' AS EndDate, 1 Value1, 1 Value2
) AS IQ1;
SELECT * FROM #Wants
这称为间隙和孤岛问题。 您想要检测连续行的组。 屏幕截图中的第 1 行和第 2 行被视为一组,因为第 1 行的结束日期等于第 2 行的开始日期,并且 id 和 value1 和 value2 相等。
因此,让我们首先检测所有组更改。 然后只需计算一行的更改即可获得组号。 然后聚合以获得每个组的开始和结束日期。
SELECT
id,
MIN(startdate) AS startdate,
MAX(enddate) AS enddate,
MIN(value1) AS value1,
MIN(value2) AS value2
FROM
(
SELECT
id, startdate, enddate, value1, value2,
SUM(chg) OVER (PARTITION BY id ORDER BY startdate) AS grp
FROM
(
SELECT
id, startdate, enddate, value1, value2,
CASE WHEN startdate = LAG(enddate) OVER (PARTITION BY id ORDER BY startdate)
AND value1 = LAG(value1) OVER (PARTITION BY id ORDER BY startdate)
AND value2 = LAG(value2) OVER (PARTITION BY id ORDER BY startdate)
THEN 0
ELSE 1
END AS chg
FROM #Haves
) with_change_flags
) with_groups
GROUP BY id, grp
ORDER BY id, grp;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.