繁体   English   中英

缓慢变化的维度数据去重

[英]deduplication of slowly changing dimension data

通过下面的代码,我尝试描述我的问题。 基本上我想从缓慢变化的维度数据中得到 go:

在此处输入图像描述

一些去重数据:

在此处输入图像描述

显然我的实际数据更大(更多 ID 和值)。 在理想情况下,我还想尽可能避免使用#Dates 帮助程序表。 我目前的尝试:

SELECT
    ID
    , Value1
    , Value2
    , MIN(#Dates.TheDate) AS StartDate
    , MAX(#Dates.TheDate) AS EndDate
FROM #Dates
INNER JOIN #Haves ON #Dates.TheDate BETWEEN #Haves.StartDate AND #Haves.EndDate
GROUP BY
    ID
    , Value1
    , Value2

不产生需求。

复制代码:

IF OBJECT_ID(N'tempdb..#Dates') IS NOT NULL DROP TABLE #Dates;
IF OBJECT_ID(N'tempdb..#Haves') IS NOT NULL DROP TABLE #Haves;
IF OBJECT_ID(N'tempdb..#Wants') IS NOT NULL DROP TABLE #Wants;

DECLARE @FromDate DATETIME, @ToDate DATETIME;
SET @FromDate = '2020-01-01';
SET @ToDate = '2020-01-31';

-- all days in that period
SELECT TOP (DATEDIFF(DAY, @FromDate, @ToDate)+1)
  TheDate = DATEADD(DAY, number, @FromDate)
INTO #Dates
FROM [master].dbo.spt_values
WHERE [type] = N'P' ORDER BY number;

SELECT * FROM #Dates

SELECT
    *
INTO #Haves
FROM (SELECT 1 ID, '2020-01-01' AS StartDate, '2020-01-03' AS EndDate, 1 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-03' AS StartDate, '2020-01-05' AS EndDate, 1 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-05' AS StartDate, '2020-01-07' AS EndDate, 3 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-07' AS StartDate, '2999-01-01' AS EndDate, 1 Value1, 1 Value2
) AS IQ1;

SELECT * from #Haves

SELECT
    ID
    , Value1
    , Value2
    , MIN(#Dates.TheDate) AS StartDate
    , MAX(#Dates.TheDate) AS EndDate
FROM #Dates
INNER JOIN #Haves ON #Dates.TheDate BETWEEN #Haves.StartDate AND #Haves.EndDate
GROUP BY
    ID
    , Value1
    , Value2

SELECT
    *
INTO #Wants
FROM (SELECT 1 ID, '2020-01-01' AS StartDate, '2020-01-05' AS EndDate, 1 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-05' AS StartDate, '2020-01-07' AS EndDate, 3 Value1, 1 Value2
      UNION
      SELECT 1 ID, '2020-01-07' AS StartDate, '2999-01-01' AS EndDate, 1 Value1, 1 Value2
) AS IQ1;

SELECT * FROM #Wants

这称为间隙和孤岛问题。 您想要检测连续行的组。 屏幕截图中的第 1 行和第 2 行被视为一组,因为第 1 行的结束日期等于第 2 行的开始日期,并且 id 和 value1 和 value2 相等。

因此,让我们首先检测所有组更改。 然后只需计算一行的更改即可获得组号。 然后聚合以获得每个组的开始和结束日期。

SELECT
  id,
  MIN(startdate) AS startdate,
  MAX(enddate) AS enddate,
  MIN(value1) AS value1,
  MIN(value2) AS value2
FROM
(
  SELECT
    id, startdate, enddate, value1, value2,
    SUM(chg) OVER (PARTITION BY id ORDER BY startdate) AS grp
  FROM
  (
    SELECT
      id, startdate, enddate, value1, value2,
      CASE WHEN startdate = LAG(enddate) OVER (PARTITION BY id ORDER BY startdate)
            AND value1    = LAG(value1)  OVER (PARTITION BY id ORDER BY startdate)
            AND value2    = LAG(value2)  OVER (PARTITION BY id ORDER BY startdate)
        THEN 0
        ELSE 1
      END AS chg
    FROM #Haves
  ) with_change_flags
) with_groups
GROUP BY id, grp
ORDER BY id, grp;

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM