简体   繁体   English

Window 函数对不同的记录进行计数

[英]Window functions to count distinct records

The query below is based on a complicated view and the view works as I want it to (I'm not going to include the view because I don't think it will help with the question at hand).下面的查询基于一个复杂的视图,该视图按我的意愿工作(我不打算包含该视图,因为我认为它不会帮助解决手头的问题)。 What I can't get right is the drugCountsinFamilies column.我不能正确的是drugCountsinFamilies列。 I need it to show me the number of distinct drugName s for each drug family.我需要它来向我显示每个药物系列的distinct drugName的数量。 You can see from the first screencap that there are three different H3A rows.您可以从第一个屏幕截图中看到有三个不同的 H3A 行。 The drugCountsInFamilies for H3A should be 3 (there are three different H3A drugs. ) H3A 的drugCountsInFamilies应该是 3(有三种不同的 H3A 药物。)

在此处输入图像描述

You can see from the second screen cap that what's happening is the drugCountsInFamilies in the first screen cap is catching the number of rows that the drug name is listed on.您可以从第二个屏幕截图中看到,第一个屏幕截图中的drugCountsInFamilies正在捕获列出药物名称的行数。
在此处输入图像描述

Below is my question, with comments on the part that is incorrect以下是我的问题,对不正确的部分进行了评论

select distinct
     rx.patid
    ,d2.fillDate
    ,d2.scriptEndDate
    ,rx.drugName
    ,rx.drugClass
    --the line directly below is the one that I can't figure out why it's wrong
    ,COUNT(rx.drugClass) over(partition by rx.patid,rx.drugclass,rx.drugname) as drugCountsInFamilies
from 
(
select 
    ROW_NUMBER() over(partition by d.patid order by d.patid,d.uniquedrugsintimeframe desc) as rn
    ,d.patid
    ,d.fillDate
    ,d.scriptEndDate
    ,d.uniqueDrugsInTimeFrame
    from DrugsPerTimeFrame as d
)d2
inner join rx on rx.patid = d2.patid
inner join DrugTable as dt on dt.drugClass=rx.drugClass
where d2.rn=1 and rx.fillDate between d2.fillDate and d2.scriptEndDate
and dt.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
order by rx.patid

SSMS gets mad if I try to add a distinct to the count(rx.drugClass) clause.如果我尝试在count(rx.drugClass)子句中添加 distinct,SSMS 会发疯。 Can it be done using window functions?可以使用 window 函数来完成吗?

I came across this question in search for a solution to my problem of counting distinct values. 我遇到了这个问题,寻找解决我计算不同值的问题的方法。 In searching for an answer I came across this post . 在寻找答案时,我发现了这篇文章 See last comment. 见最后评论。 I've tested it and used the SQL. 我测试了它并使用了SQL。 It works really well for me and I figured that I would provide another solution here. 它对我来说真的很好,我想我会在这里提供另一个解决方案。

In summary, using DENSE_RANK() , with PARTITION BY the grouped columns, and ORDER BY both ASC and DESC on the columns to count: 总之,使用DENSE_RANK()PARTITION BY分组列, ORDER BY ASCDESC对列进行计数:

DENSE_RANK() OVER (PARTITION BY drugClass ORDER BY drugName ASC) +
DENSE_RANK() OVER (PARTITION BY drugClass ORDER BY drugName DESC) - 1 AS drugCountsInFamilies

I use this as a template for myself. 我用这个作为自己的模板。

DENSE_RANK() OVER (PARTITION BY PartitionByFields ORDER BY OrderByFields ASC ) +
DENSE_RANK() OVER (PARTITION BY PartitionByFields ORDER BY OrderByFields DESC) - 1 AS DistinctCount

I hope this helps! 我希望这有帮助!

Doing a count(distinct) as a windows function requires a trick. count(distinct)作为Windows函数需要一个技巧。 Several levels of tricks, actually. 实际上有几个级别的技巧。

Because your request is actually truly simple -- the value is always 1 because rx.drugClass is in the partitioning clause -- I will make an assumption. 因为您的请求实际上非常简单 - 值始终为1,因为rx.drugClass位于分区子句中 - 我将做出假设。 Let's say you want to count the number of unique drug classes per patid. 假设您想要计算每个独特药物类别的数量。

If so, do a row_number() partitioned by patid and drugClass. 如果是这样,请执行由patid和drugClass分区的row_number() When this is 1, within a patid, , then a new drugClass is starting. 当这是1,在一个patid,然后一个新的drugClass开始。 Create a flag that is 1 in this case and 0 in all other cases. 创建一个在这种情况下为1的标志,在所有其他情况下为0。

Then, you can simply do a sum with a partitioning clause to get the number of distinct values. 然后,您可以简单地使用分区子句进行sum以获取不同值的数量。

The query (after formatting it so I can read it), looks like: 查询(格式化之后我可以阅读它),如下所示:

select rx.patid, d2.fillDate, d2.scriptEndDate, rx.drugName, rx.drugClass,
       SUM(IsFirstRowInGroup) over (partition by rx.patid) as NumDrugCount
from (select distinct rx.patid, d2.fillDate, d2.scriptEndDate, rx.drugName, rx.drugClass,
             (case when 1 = ROW_NUMBER() over (partition by rx.drugClass, rx.patid order by (select NULL))
                   then 1 else 0
              end) as IsFirstRowInGroup
      from (select ROW_NUMBER() over(partition by d.patid order by d.patid,d.uniquedrugsintimeframe desc) as rn, 
                   d.patid, d.fillDate, d.scriptEndDate, d.uniqueDrugsInTimeFrame
            from DrugsPerTimeFrame as d
           ) d2 inner join
           rx
           on rx.patid = d2.patid inner join
           DrugTable dt
           on dt.drugClass = rx.drugClass
      where d2.rn=1 and rx.fillDate between d2.fillDate and d2.scriptEndDate and
            dt.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
     ) t
order by patid

I think what you were trying to do is this as a window function:我认为您尝试做的是将其作为窗口函数:

COUNT(DISTINCT rx.drugName) over(partition by rx.patid,rx.drugclass) as drugCountsInFamilies

which SQL complains about.哪个 SQL 抱怨。 But you can do this instead:但是你可以这样做:

SELECT 
rx.patid
, rx.drugName
, rx.drugClass
, (SELECT COUNT(DISTINCT rx2.drugName) FROM rx rx2 WHERE rx2.drugClass = rx.DrugClass AND rx2.patid = rx.patid) As drugCountsInFamilies
FROM rx
...

If the table is large then it's best to put an index on one of the columns (eg patid) so that the nested query doesn't consume a lot of resources.如果表很大,那么最好将索引放在其中一列(例如 patid)上,这样嵌套查询就不会消耗大量资源。

select max(dense_rank() over (order by name desc partition by family)) over (partition by family) 

Could this work?这能行吗?

Why would something like this not work? 为什么这样的事情不起作用?

SELECT 
   IDCol_1
  ,IDCol_2
  ,Count(*) Over(Partition By IDCol_1, IDCol_2 order by IDCol_1) as numDistinct
FROM Table_1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM