select count (distinct) returning incorrect count

Question

I'm trying to run a code that will summarise a list of clients based on age categories. I have the client ID and their ages, and am using a CASE WHEN to group the ages into buckets, then trying to count the client ids in each of those buckets.

For info - the Assisted Date field is just the date on which the client was served and this is just included in my testing to keep the results small - so I'm just focusing on Feb 2019 services that aren't Assessments; and the SIR ID is the location at which the service was provided.

select distinct 
    CASE
        when CD.Age between 0 and 5 then '0-5'
        when CD.Age between 6 and 11 then '6-11'
        when CD.Age between 12 and 14 then '12-14'
        when CD.Age between 15 and 17 then '15-17'
        when CD.Age between 18 and 24 then '18-24'
        when CD.Age between 25 and 54 then '25-54'
        when CD.Age between 55 and 64 then '55-64'
        when CD.Age > 65 then '65+'
        else 'Unknown'
    END AS Age_Group, 
count(distinct AP.Source_Individual_ID) as "Count"
from dm.Assistance_Provided AP, rpt.ClientsDemographics CD
        where CD.Source_Individual_ID = AP.Source_Individual_ID
        and AP.SIR_ID = '1909'
        and AP.Service_Provided <> 'Assessment'
        and year(AP.Assisted_Date) = '2019'
        and month(AP.Assisted_Date) = 2
group by CD.Age

If I run the code excluding the count(distinct) I end up with 17 rows of data. The 17 clients are all aged between 18-24 so the Age-Group for all clients come back as 18-24. So the Case When appears to be operating correctly.

However, if I run the code with the count(distinct), I end up with 3 rows of data. the Age Group 18-24 is listed 3 times and the Count as 2, 4, and 5 respectively for the 3 lines. I definitely have 17 different IDs in the list, but for some reason this is being reduced to a total of 11 when I add the count(distinct).

I've also tried removing the distinct from the select ie so

select 
    CASE
        when CD.Age between ....

If I do that I get 5 rows - still all with age group 18-24, but then with 2,4,5,4,2 respectively - so then back to my total of 17. But I'm not sure why I'm not just getting one line back : Age Group: 18-24; Count: 17

What am I doing wrong?

Answer 1

I suspect that you should actually be aggregating by Source_Individual , and then taking conditional counts everywhere:

SELECT
    CD.Source_Individual_ID,
    COUNT(CASE WHEN CD.Age BETWEEN 0 AND 5 THEN 1 END) AS [0-5],
    COUNT(CASE WHEN CD.Age BETWEEN 6 AND 11 THEN 1 END) AS [6-11],
    COUNT(CASE WHEN CD.Age BETWEEN 12 AND 14 THEN 1 END) AS [12-14],
    COUNT(CASE WHEN CD.Age BETWEEN 15 AND 17 THEN 1 END) AS [15-17],
    COUNT(CASE WHEN CD.Age BETWEEN 18 AND 24 THEN 1 END) AS [18-24],
    COUNT(CASE WHEN CD.Age BETWEEN 25 AND 54 THEN 1 END) AS [25-54],
    COUNT(CASE WHEN CD.Age BETWEEN 55 AND 64 THEN 1 END) AS [55-64]
    COUNT(CASE WHEN CD.Age > 65 THEN 1 END) AS [65+],
    COUNT(*) AS [Number of Clients Assisted]
FROM dm.Assistance_Provided AP 
INNER JOIN rpt.ClientsDemographics CD
    ON CD.Source_Individual_ID = AP.Source_Individual_ID
WHERE
    AP.SIR_ID = '1909' AND
    AP.Service_Provided <> 'Assessment' AND
    YEAR(AP.Assisted_Date) = 2019 AND
    MONTH(AP.Assisted_Date) = 2
GROUP BY
    CD.Source_Individual_ID;

Note that I have rewritten your query to use explicit, modern, inner joins, rather than the implicit joins you were using.

Answer 2

You need to put your case statement in your group by . As it is, since you group by CD.Age it will write a single row for every distinct age. In other words if you have two rows with Age = 12 and one row with Age = 13 you'll get two rows, both with 12-14 as their age group, but with a count of 2 for the first and 1 for the second. To make it more confusing, if you only had one row each, then because both rows are the same and you have the distinct clause it would only return 1 row with a count of 1.

If you put your case statement into the group by, then it will group by each distinct Age_Group instead. ie, it does the case transformation first, then groups by it.

So try:

select 
    CASE
        when CD.Age between 0 and 5 then '0-5'
        when CD.Age between 6 and 11 then '6-11'
        when CD.Age between 12 and 14 then '12-14'
        when CD.Age between 15 and 17 then '15-17'
        when CD.Age between 18 and 24 then '18-24'
        when CD.Age between 25 and 54 then '25-54'
        when CD.Age between 55 and 64 then '55-64'
        when CD.Age > 65 then '65+'
        else 'Unknown'
    END AS Age_Group, 
count(distinct AP.Source_Individual_ID) as "Number of Clients Assisted"
from dm.Assistance_Provided AP, rpt.ClientsDemographics CD
        where CD.Source_Individual_ID = AP.Source_Individual_ID
        and AP.SIR_ID = '1909'
        and AP.Service_Provided <> 'Assessment'
        and year(AP.Assisted_Date) = '2019'
        and month(AP.Assisted_Date) = 2
group by CASE
        when CD.Age between 0 and 5 then '0-5'
        when CD.Age between 6 and 11 then '6-11'
        when CD.Age between 12 and 14 then '12-14'
        when CD.Age between 15 and 17 then '15-17'
        when CD.Age between 18 and 24 then '18-24'
        when CD.Age between 25 and 54 then '25-54'
        when CD.Age between 55 and 64 then '55-64'
        when CD.Age > 65 then '65+'
        else 'Unknown'
    END

select count (distinct) returning incorrect count

Question

2 answers

solution1
1 2019-05-30 05:26:27

solution2
0 ACCPTED 2019-05-30 05:26:29

select count (distinct) returning incorrect count

Question

2 answers

solution1 1 2019-05-30 05:26:27

solution2 0 ACCPTED 2019-05-30 05:26:29

solution1
1 2019-05-30 05:26:27

solution2
0 ACCPTED 2019-05-30 05:26:29