简体   繁体   中英

Window functions to count distinct records

The query below is based on a complicated view and the view works as I want it to (I'm not going to include the view because I don't think it will help with the question at hand). What I can't get right is the drugCountsinFamilies column. I need it to show me the number of distinct drugName s for each drug family. You can see from the first screencap that there are three different H3A rows. The drugCountsInFamilies for H3A should be 3 (there are three different H3A drugs. )

在此处输入图像描述

You can see from the second screen cap that what's happening is the drugCountsInFamilies in the first screen cap is catching the number of rows that the drug name is listed on.
在此处输入图像描述

Below is my question, with comments on the part that is incorrect

select distinct
     rx.patid
    ,d2.fillDate
    ,d2.scriptEndDate
    ,rx.drugName
    ,rx.drugClass
    --the line directly below is the one that I can't figure out why it's wrong
    ,COUNT(rx.drugClass) over(partition by rx.patid,rx.drugclass,rx.drugname) as drugCountsInFamilies
from 
(
select 
    ROW_NUMBER() over(partition by d.patid order by d.patid,d.uniquedrugsintimeframe desc) as rn
    ,d.patid
    ,d.fillDate
    ,d.scriptEndDate
    ,d.uniqueDrugsInTimeFrame
    from DrugsPerTimeFrame as d
)d2
inner join rx on rx.patid = d2.patid
inner join DrugTable as dt on dt.drugClass=rx.drugClass
where d2.rn=1 and rx.fillDate between d2.fillDate and d2.scriptEndDate
and dt.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
order by rx.patid

SSMS gets mad if I try to add a distinct to the count(rx.drugClass) clause. Can it be done using window functions?

I came across this question in search for a solution to my problem of counting distinct values. In searching for an answer I came across this post . See last comment. I've tested it and used the SQL. It works really well for me and I figured that I would provide another solution here.

In summary, using DENSE_RANK() , with PARTITION BY the grouped columns, and ORDER BY both ASC and DESC on the columns to count:

DENSE_RANK() OVER (PARTITION BY drugClass ORDER BY drugName ASC) +
DENSE_RANK() OVER (PARTITION BY drugClass ORDER BY drugName DESC) - 1 AS drugCountsInFamilies

I use this as a template for myself.

DENSE_RANK() OVER (PARTITION BY PartitionByFields ORDER BY OrderByFields ASC ) +
DENSE_RANK() OVER (PARTITION BY PartitionByFields ORDER BY OrderByFields DESC) - 1 AS DistinctCount

I hope this helps!

Doing a count(distinct) as a windows function requires a trick. Several levels of tricks, actually.

Because your request is actually truly simple -- the value is always 1 because rx.drugClass is in the partitioning clause -- I will make an assumption. Let's say you want to count the number of unique drug classes per patid.

If so, do a row_number() partitioned by patid and drugClass. When this is 1, within a patid, , then a new drugClass is starting. Create a flag that is 1 in this case and 0 in all other cases.

Then, you can simply do a sum with a partitioning clause to get the number of distinct values.

The query (after formatting it so I can read it), looks like:

select rx.patid, d2.fillDate, d2.scriptEndDate, rx.drugName, rx.drugClass,
       SUM(IsFirstRowInGroup) over (partition by rx.patid) as NumDrugCount
from (select distinct rx.patid, d2.fillDate, d2.scriptEndDate, rx.drugName, rx.drugClass,
             (case when 1 = ROW_NUMBER() over (partition by rx.drugClass, rx.patid order by (select NULL))
                   then 1 else 0
              end) as IsFirstRowInGroup
      from (select ROW_NUMBER() over(partition by d.patid order by d.patid,d.uniquedrugsintimeframe desc) as rn, 
                   d.patid, d.fillDate, d.scriptEndDate, d.uniqueDrugsInTimeFrame
            from DrugsPerTimeFrame as d
           ) d2 inner join
           rx
           on rx.patid = d2.patid inner join
           DrugTable dt
           on dt.drugClass = rx.drugClass
      where d2.rn=1 and rx.fillDate between d2.fillDate and d2.scriptEndDate and
            dt.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
     ) t
order by patid

I think what you were trying to do is this as a window function:

COUNT(DISTINCT rx.drugName) over(partition by rx.patid,rx.drugclass) as drugCountsInFamilies

which SQL complains about. But you can do this instead:

SELECT 
rx.patid
, rx.drugName
, rx.drugClass
, (SELECT COUNT(DISTINCT rx2.drugName) FROM rx rx2 WHERE rx2.drugClass = rx.DrugClass AND rx2.patid = rx.patid) As drugCountsInFamilies
FROM rx
...

If the table is large then it's best to put an index on one of the columns (eg patid) so that the nested query doesn't consume a lot of resources.

select max(dense_rank() over (order by name desc partition by family)) over (partition by family) 

Could this work?

Why would something like this not work?

SELECT 
   IDCol_1
  ,IDCol_2
  ,Count(*) Over(Partition By IDCol_1, IDCol_2 order by IDCol_1) as numDistinct
FROM Table_1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM