简体   繁体   中英

Year Over Year (YOY) Distinct Count

EDITED:

I'm working in Sql Server 2005 and I'm trying to get a year over year (YOY) count of distinct users for the current fiscal year (say Jun 1-May 30) and the past 3 years. I'm able to do what I need by running a select statement four times, but I can't seem to find a better way at this point. I'm able to get a distinct count for each year in one query, but I need it to a cumulative distinct count. Below is a mockup of what I have so far:

SELECT [Year], COUNT(DISTINCT UserID)
FROM
(
  SELECT u.uID AS UserID, 
      CASE 
        WHEN dd.ddEnd BETWEEN @yearOneStart AND @yearOneEnd THEN 'Year1'
        WHEN dd.ddEnd BETWEEN @yearTwoStart AND @yearTwoEnd THEN 'Year2'
        WHEN dd.ddEnd BETWEEN @yearThreeStart AND @yearThreeEnd THEN 'Year3'
        WHEN dd.ddEnd BETWEEN @yearFourStart AND @yearFourEnd THEN 'Year4'
        ELSE 'Other'
      END AS [Year]
  FROM Users AS u   
  INNER JOIN UserDataIDMatch AS udim
    ON u.uID = udim.udim_FK_uID
  INNER JOIN DataDump AS dd
    ON udim.udimUserSystemID = dd.ddSystemID
) AS Data
WHERE LOWER([Year])  'other'
GROUP BY
  [Year]

I get something like:

Year1    1
Year2    1
Year3    1
Year4    1

But I really need:

Year1    1
Year2    2
Year3    3
Year4    4

Below is a rough schema and set of values (updated for simplicity). I tried to create a SQL Fiddle, but I'm getting a disk space error when I attempt to build the schema.

CREATE TABLE Users
(
  uID int identity primary key, 
  uFirstName varchar(75), 
  uLastName varchar(75)
);

INSERT INTO Users (uFirstName, uLastName)
VALUES
('User1', 'User1'),
('User2', 'User2')
('User3', 'User3')
('User4', 'User4');

CREATE TABLE UserDataIDMatch
(
  udimID int indentity primary key,
  udim.udim_FK_uID int foreign key references Users(uID),
  udimUserSystemID varchar(75)
);

INSERT INTO UserDataIDMatch (udim_FK_uID, udimUserSystemID)
VALUES
(1, 'SystemID1'),
(2, 'SystemID2'),
(3, 'SystemID3'),
(4, 'SystemID4');

CREATE TABLE DataDump
(
  ddID int identity primary key,
  ddSystemID varchar(75),
  ddEnd datetime
);

INSERT INTO DataDump (ddSystemID, ddEnd)
VALUES
('SystemID1', '10-01-2013'),
('SystemID2', '10-01-2014'),
('SystemID3', '10-01-2015'),
('SystemID4', '10-01-2016');

Concept using an existing query

I have done something similar for finding out the number of distinct customers who bought something in between years, I modified it to use your concept of year, the variables you add would be that start day and start month of the year and the start year and end year .

Technically there is a way to avoid using a loop but this is very clear and you can't go past year 9999 so don't feel like putting clever code to avoid a loop makes sense

Tips for speeding up the query

Also when matching dates make sure you are comparing dates, and not comparing a function evaluation of the column as that would mean running the function on every record set and would make indices useless if they existed on dates (which they should). Use date add on zero to initiate your target dates subtracting 1900 from the year, one from the month and one from the target date.

Then self join on the table where the dates create a valid range (ie yearlessthan to yearmorethan) and use a subquery to create a sum based on that range. Since you want accumulative from the first year to the last limit the results to starting at the first year.

At the end you will be missing the first year as by our definition it does not qualify as a range, to fix this just do a union all on the temp table you created to add the missing year and the number of distinct values in it.

DECLARE @yearStartMonth INT = 6, @yearStartDay INT = 1
DECLARE @yearStart INT = 2008, @yearEnd INT  = 2012

DECLARE @firstYearStart DATE = 
        DATEADD(day,@yearStartDay-1, 
           DATEADD(month, @yearStartMonth-1,
             DATEADD(year, @yearStart- 1900,0)))

DECLARE @lastYearEnd DATE = 
        DATEADD(day, @yearStartDay-2,
            DATEADD(month, @yearStartMonth-1,
               DATEADD(year, @yearEnd -1900,0)))    

DECLARE @firstdayofcurrentyear DATE = @firstYearStart 
DECLARE @lastdayofcurrentyear DATE = DATEADD(day,-1,DATEADD(year,1,@firstdayofcurrentyear))
DECLARE @yearnumber INT = YEAR(@firstdayofcurrentyear)

DECLARE @tempTableYearBounds TABLE
(
    startDate DATE NOT NULL,
    endDate DATE NOT NULL,
    YearNumber INT NOT NULL
)

WHILE @firstdayofcurrentyear < @lastYearEnd
BEGIN
INSERT INTO @tempTableYearBounds 
VALUES(@firstdayofcurrentyear,@lastdayofcurrentyear,@yearNumber)
SET @firstdayofcurrentyear = DATEADD(year,1,@firstdayofcurrentyear)
SET @lastdayofcurrentyear = DATEADD(year,1,@lastdayofcurrentyear)
SET @yearNumber = @yearNumber + 1
END



DECLARE @tempTableCustomerCount TABLE
(
    [Year] INT NOT NULL,
    [CustomerCount] INT NOT NULL
)


INSERT INTO @tempTableCustomerCount
SELECT 
YearNumber as [Year],
COUNT(DISTINCT CustomerNumber) as CutomerCount
FROM Ticket
     JOIN @tempTableYearBounds ON
          TicketDate >= startDate AND TicketDate <=endDate
GROUP BY YearNumber


SELECT * FROM(
SELECT t2.Year as [Year], 
(SELECT 
    SUM(CustomerCount) 
    FROM @tempTableCustomerCount 
    WHERE Year>=t1.Year 
         AND Year <=t2.Year) AS CustomerCount
FROM @tempTableCustomerCount t1 JOIN @tempTableCustomerCount t2
    ON t1.Year < t2.Year
WHERE t1.Year = @yearStart
UNION
SELECT [Year], [CustomerCount] 
FROM @tempTableCustomerCount
WHERE [YEAR] = @yearStart
) tt
ORDER BY tt.Year

It isn't efficient but at the end the temp table you are dealing with is so small I don't think it really matters, and adds a lot more versatility versus the method you are using.

Update: I updated the query to reflect the result you wanted with my data set, I was basically testing to see if this was faster, it was faster by 10 seconds but the dataset I am dealing with is relatively small. (from 12 seconds to 2 seconds).

Using your data

I changed the tables you gave to temp tables so it didn't effect my environment and I removed the foreign key because they are not supported for temp tables, the logic is the same as the example included but just changed for your dataset.

DECLARE @startYear INT = 2013, @endYear INT = 2016
DECLARE @yearStartMonth INT = 10 , @yearStartDay INT = 1
DECLARE @startDate DATETIME = DATEADD(day,@yearStartDay-1,
                                 DATEADD(month, @yearStartMonth-1,
                                     DATEADD(year,@startYear-1900,0)))
DECLARE @endDate DATETIME = DATEADD(day,@yearStartDay-1,
                                DATEADD(month,@yearStartMonth-1,
                                    DATEADD(year,@endYear-1899,0)))

DECLARE @tempDateRangeTable TABLE
(
    [Year] INT NOT NULL,
    StartDate DATETIME NOT NULL, 
    EndDate DATETIME NOT NULL
)

DECLARE @currentDate DATETIME = @startDate
WHILE @currentDate < @endDate
BEGIN
DECLARE @nextDate DATETIME  = DATEADD(YEAR, 1, @currentDate)
INSERT INTO @tempDateRangeTable(Year,StartDate,EndDate)
VALUES(YEAR(@currentDate),@currentDate,@nextDate)
SET @currentDate = @nextDate
END

CREATE TABLE Users
(
  uID int identity primary key, 
  uFirstName varchar(75), 
  uLastName varchar(75)
);

INSERT INTO Users (uFirstName, uLastName)
VALUES
('User1', 'User1'),
('User2', 'User2'),
('User3', 'User3'),
('User4', 'User4');

CREATE TABLE UserDataIDMatch
(
  udimID int indentity primary key,
  udim.udim_FK_uID int foreign key references Users(uID),
  udimUserSystemID varchar(75)
);

INSERT INTO UserDataIDMatch (udim_FK_uID, udimUserSystemID)
VALUES
(1, 'SystemID1'),
(2, 'SystemID2'),
(3, 'SystemID3'),
(4, 'SystemID4');

CREATE TABLE DataDump
(
  ddID int identity primary key,
  ddSystemID varchar(75),
  ddEnd datetime
);   


INSERT INTO DataDump (ddSystemID, ddEnd)
VALUES
('SystemID1', '10-01-2013'),
('SystemID2', '10-01-2014'),
('SystemID3', '10-01-2015'),
('SystemID4', '10-01-2016');


DECLARE @tempIndividCount TABLE
(
    [Year] INT NOT NULL,
    UserCount INT NOT NULL
)


-- no longer need to filter out other because you are using an
--inclusion statement rather than an exclusion one, this will 
--also make your query faster (when using real tables not temp ones)
INSERT INTO @tempIndividCount(Year,UserCount)
SELECT tdr.Year, COUNT(DISTINCT UId) FROM
    Users u JOIN UserDataIDMatch um
       ON um.udim_FK_uID  = u.uID
    JOIN DataDump dd ON
       um.udimUserSystemID = dd.ddSystemID
    JOIN @tempDateRangeTable tdr ON
        dd.ddEnd >= tdr.StartDate AND dd.ddEnd < tdr.EndDate
GROUP BY tdr.Year

-- will show you your result
SELECT * FROM @tempIndividCount

--add any ranges that did not have an entry but were in your range
--can easily remove this by taking this part out.
INSERT INTO @tempIndividCount
SELECT t1.Year,0 FROM
    @tempDateRangeTable t1 LEFT OUTER JOIN @tempIndividCount t2
        ON  t1.Year = t2.Year
    WHERE t2.Year IS NULL


SELECT YearNumber,UserCount FROM (
SELECT 'Year'+CAST(((t2.Year-t1.Year)+1) AS CHAR) [YearNumber] ,t2.Year,(
    SELECT SUM(UserCount) 
    FROM @tempIndividCount 
    WHERE Year >= t1.Year AND Year <=t2.Year 
    ) AS UserCount 
FROM @tempIndividCount t1 
    JOIN @tempIndividCount t2
        ON t1.Year < t2.Year
WHERE t1.Year = @startYear
UNION ALL 
--add the missing first year, union it to include the value
SELECT 'Year1',Year, UserCount FROM @tempIndividCount
WHERE Year = @startYear) tt
ORDER BY tt.Year

Benefits over using a WHEN CASE based approach

More Robust

Do not need to explicitly determine the end and start dates of each year, just like in a logical year just need to know the start and end date. Can easily change what you are looking for with some simple modifications(ie say you want all 2 year ranges or 3 year).

Will be faster if the database is indexed properly

Since you are searching based on the same data type you can utilize the indices that should be created on the date columns in the database.

Cons

More Complicated

The query is a lot more complicated to follow, even though it is more robust there is a lot of extra logic in the actual query.

In some circumstance will not provide good boost to execution time

If the dataset is very small, or the number of dates being compared isn't significant then this could not save enough time to be worth it.

Unless I'm missing something, you just want to know how many records there are where the date is less than or equal to the current fiscal year.

DECLARE @YearOneStart DATETIME, @YearOneEnd DATETIME,
        @YearTwoStart DATETIME, @YearTwoEnd DATETIME,
        @YearThreeStart DATETIME, @YearThreeEnd DATETIME,
        @YearFourStart DATETIME, @YearFourEnd DATETIME

SELECT @YearOneStart = '06/01/2013', @YearOneEnd = '05/31/2014',
        @YearTwoStart = '06/01/2014', @YearTwoEnd = '05/31/2015',
        @YearThreeStart = '06/01/2015', @YearThreeEnd = '05/31/2016',
        @YearFourStart = '06/01/2016', @YearFourEnd = '05/31/2017'

;WITH cte AS 
(
      SELECT u.uID AS UserID, 
      CASE 
        WHEN dd.ddEnd BETWEEN @yearOneStart AND @yearOneEnd THEN 'Year1'
        WHEN dd.ddEnd BETWEEN @yearTwoStart AND @yearTwoEnd THEN 'Year2'
        WHEN dd.ddEnd BETWEEN @yearThreeStart AND @yearThreeEnd THEN 'Year3'
        WHEN dd.ddEnd BETWEEN @yearFourStart AND @yearFourEnd THEN 'Year4'
        ELSE 'Other'
      END AS [Year]
      FROM Users AS u   
      INNER JOIN UserDataIDMatch AS udim
        ON u.uID = udim.udim_FK_uID
      INNER JOIN DataDump AS dd
        ON udim.udimUserSystemID = dd.ddSystemID
)

SELECT 
    DISTINCT [Year],
    (SELECT COUNT(*) FROM cte cteInner WHERE cteInner.[Year] <= cteMain.[Year] )
FROM cte cteMain

In SQL Server once you match a WHEN inside a CASE , it stop evaluating will not going on evaluating next WHEN clauses. Hence you can't accumulate that way.

if I understand you correctly, this would show your results.

;WITH cte AS 
(F
    SELECT dd.ddEnd [dateEnd], u.uID AS UserID
    FROM Users AS u     
    INNER JOIN UserDataIDMatch AS udim
        ON u.uID = udim.udim_FK_uID
    INNER JOIN DataDump AS dd
        ON udim.udimUserSystemID = dd.ddSystemID
    WHERE ddEnd BETWEEN @FiscalYearStart AND @FiscalYearEnd3
) 
SELECT datepart(year, @FiscalYearStart) AS [Year], COUNT(DISTINCT UserID) AS CntUserID
FROM cte
WHERE dateEnd BETWEEN  @FiscalYearStart AND @FiscalYearEnd1
GROUP BY @FiscalYearStart
UNION 
SELECT datepart(year, @FiscalYearEnd1) AS [Year], COUNT(DISTINCT UserID) AS CntUserID
FROM cte
WHERE dateEnd BETWEEN  @FiscalYearStart AND @FiscalYearEnd2
GROUP BY @FiscalYearEnd1
UNION 
SELECT datepart(year, @FiscalYearEnd3) AS [Year], COUNT(DISTINCT UserID) AS CntUserID
FROM cte
WHERE dateEnd BETWEEN  @FiscalYearStart AND @FiscalYearEnd3
GROUP BY @FiscalYearEnd2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM