简体   繁体   English

t-sql GROUP BY使用COUNT,然后从COUNT包含MAX

[英]t-sql GROUP BY with COUNT, and then include MAX from the COUNT

Suppose you had a table of "Cars" with hundreds of thousands of rows, and you wanted to do a GROUP BY: 假设您有一个包含数十万行的“Cars”表,并且您想要进行GROUP BY:

SELECT   CarID
         , CarName
         , COUNT(*) AS Total
FROM     dbo.tbl_Cars
GROUP BY CarID
         , CarName

The grouping leaves you with a result akin to: 分组会给您留下类似于以下内容的结果:

CarID       CarName    Total
1872        Olds       202,121   
547841      BMW        175,298
9877        Ford        10,241

All fine and well. 一切都很好。 My question, though, is what is the best way to get the Total and the MAX Total into one table, in terms of performance and clean coding, so you have a result like: 我的问题是,在性能和清晰编码方面,将Total和MAX Total分成一个表的最佳方法是什么,所以你得到的结果如下:

CarID       CarName    Total      Max Total
1872        Olds       202,121    202,121
547841      BMW        175,298    202,121
9877        Ford        10,241    202,121 

One approach would be to put the GROUP result into a temp table, and then get the MAX from the temp table into a local variable. 一种方法是将GROUP结果放入临时表,然后将临时表中的MAX转换为局部变量。 But I'm wondering what the best way to do this would be. 但我想知道最好的方法是什么。


UPDATE UPDATE

The Common Table Expression seems the most elegant to write, yet similar to @EBarr, my limited testing indicates a significantly slower performance. Common Table Expression似乎是最优雅的,但与@EBarr类似,我的有限测试表明性能明显变慢。 So I won't be going with the CTE. 所以我不会参加CTE。

As the link @EBarr has for the COMPUTE option indicates the feature is deprecated, that doesn't seem the best route, either. 由于@EBarr对COMPUTE选项的链接表明该功能已被弃用,这似乎也不是最佳路由。

The option of a local variable for the MAX value and the use of a temp table will likely be the route I go down, as I'm not aware of performance issues with it. MAX值的局部变量选项和临时表的使用可能是我失败的路线,因为我不知道它的性能问题。

A bit more detail about my use case: it could probably end up being a series of other SO questions. 关于我的用例的更多细节:它可能最终成为一系列其他SO问题。 But suffice to say that I'm loading a large subset of data into a temp table (so a subset of tbl_Cars is going into #tbl_Cars, and even #tbl_Cars may be further filtered and have aggregations performed on it), because I have to perform multiple filtering and aggregation queries on it within a single stored proc that returns multiple result sets. 但足以说我正在将一大部分数据加载到临时表中(因此tbl_Cars的一个子集进入#tbl_Cars,甚至#tbl_Cars可能会进一步过滤并在其上执行聚合),因为我必须在单个存储过程中对其执行多个过滤和聚合查询,该过程返回多个结果集。


UPDATE 2 更新2

@EBarr's use of a windowed function is nice and short. @ EBarr使用窗口函数很好很短。 Note to self: if using a RIGHT JOIN to an outer reference table, the COUNT() function should select a column from tbl_Cars, not '*' . 自我注意:如果使用RIGHT JOIN到外部引用表, COUNT()函数应该从tbl_Cars中选择一列,而不是'*'

SELECT       M.MachineID
             , M.MachineType
             , COUNT(C.CarID) AS Total
             , MAX(COUNT(C.CarID)) OVER() as MaxTotal
FROM         dbo.tbl_Cars C
RIGHT JOIN   dbo.tbl_Machines M
      ON     C.CarID = M.CarID
GROUP BY     M.MachineID
             , M.MachineType

In terms of speed, it seems fine, but at what point do you have to be worried about the number of reads? 在速度方面,它似乎很好,但你在什么时候需要担心读取的数量?

Mechanically there are a few ways to do this. 在机械方面,有几种方法可以做到这一点。 You could use temp tables/table variable. 您可以使用临时表/表变量。 Another way is with nested queries and/or a CTE as @Aaron_Bertrand showed. 另一种方法是使用嵌套查询和/或CTE,如@Aaron_Bertrand所示。 A third way is to use WINDOWED FUNCTIONS such as... 第三种方法是使用WINDOWED FUNCTIONS,例如......

SELECT    CarName,
          COUNT(*) as theCount,
          MAX(Count(*)) OVER(PARTITION BY 'foo') as MaxPerGroup
FROM      dbo.tbl_Cars
GROUP BY CarName

A DISFAVORED (read depricated) fourth way is using the COMPUTE keyword as such... 一个DISFAVORED (读取描述)第四种方式是使用COMPUTE关键字......

SELECT   CarID, CarName, Count(*)
FROM     dbo.tbl_Cars
GROUP BY CarID, CarName 
COMPUTE MAX(Count(*))   

The COMPUTE keyword generates totals that appear as additional summary columns at the end of the result set ( see this ). COMPUTE关键字生成的总计在结果集的末尾显示为附加的摘要列( 请参阅此内容 )。 In the query above you will actually see two record sets. 在上面的查询中,您将实际看到两个记录集。

Fastest 最快的

Now, the next issue is what's the "best/fastest/easiest." 现在,下一个问题是“最好/最快/最简单”。 I immediately think of an indexed view . 我立刻想到了一个indexed view As @Aaron gently reminded me, indexed views have all sorts of restrictions. 正如@Aaron温和地提醒我的那样,索引视图有各种各样的限制。 The above, strategy, however, allows you to create an indexed view on the SELECT...FROM..GROUP BY. 但是,上述策略允许您在SELECT ... FROM..GROUP BY上创建索引视图。 Then selecting from the indexed view apply the WINDOWED FUNCTION clause. 然后从索引视图中选择应用WINDOWED FUNCTION子句。

Without knowing more, however, about your design it is going to be difficult for anyone tell you what's best. 然而,如果不了解更多关于您的设计的信息,那么任何人都很难告诉您什么是最好的。 You will get lighting fast queries from an indexed view. 您将从索引视图中获得快速查询。 That performance comes at a price, though. 不过,这种表现是有代价的。 The price is maintenance costs. 价格是维护费用。 If the underlying table is the target of a large amount of insert/update/delete operations the maintenance of the indexed view will bog down performance in other areas. 如果基础表是大量插入/更新/删除操作的目标,则索引视图的维护将使其他区域的性能陷入停滞。

If you share a bit more about your use case and data access patterns people will be able to share more insight. 如果您分享有关您的用例和数据访问模式的更多信息,那么人们将能够分享更多洞察力。


MICRO PERFORMANCE TEST 微观性能测试

So I generated a little data script and looked at sql profiler numbers for the CTE performance vs windowed functions. 所以我生成了一个小数据脚本,并查看了CTE性能与窗口函数的sql profiler数字。 This is a micro-test, so try some real numbers in your system under real load . 这是一个微型的测试,所以尽量真实负载系统一些实数。

Data generation: 数据生成:

Create table Cars ( CarID int identity (1,1) primary key, 
                    CarName varchar(20), 
                    value int)
GO
insert into Cars (CarName, value)
values  ('Buick', 100),
        ('Ford', 10),
        ('Buick', 300),     
        ('Buick', 100),
        ('Pontiac', 300),       
        ('Bmw', 100),
        ('Mecedes', 300),       
        ('Chevy', 300),     
        ('Buick', 100),
        ('Ford', 200);
GO 1000

This script generates 10,000 rows. 该脚本生成10,000行。 I then ran each of the four following queries multiple times : 然后,我多次运行以下四个查询中的每一个:

--just group by
select  CarName,COUNT(*) countThis
FROM    Cars
GROUP BY CarName        

--group by with compute (BAD BAD DEVELOPER!)
select  CarName,COUNT(*) countThis
FROM    Cars
GROUP BY CarName        
COMPUTE  MAX(Count(*));

-- windowed aggregates...
SELECT  CarName,
        COUNT(*) as theCount,
        MAX(Count(*)) OVER(PARTITION BY 'foo') as MaxInAnyGroup
FROM Cars
GROUP BY CarName        

--CTE version
;WITH x AS (
  SELECT   CarName,
           COUNT(*) AS Total
  FROM     Cars
  GROUP BY CarName
)
SELECT x.CarName, x.Total, x2.[Max Total]
FROM x CROSS JOIN (
  SELECT [Max Total] = MAX(Total) FROM x
) AS x2;

After running the above queries, I created an indexed view on the "just group by" query above. 运行上述查询后,我在上面的“just group by”查询中创建了一个索引视图。 Then I ran a query on the indexed view that performed a MAX(Count(*)) OVER(PARTITION BY 'foo' . 然后我对索引视图运行了一个查询,该查询执行了MAX(Count(*)) OVER(PARTITION BY 'foo'

AVERAGE RESULTS 平均结果

Query                      CPU       Reads     Duration   
--------------------------------------------------------
Group By                   15        31        7 ms  
Group & Compute            15        31        7 ms
Windowed Functions         14        56        8 ms 
Common Table Exp.          16        62       15 ms
Windowed on Indexed View    0        24        0 ms

Obviously this is a micro-benchmark and only mildly instructive, so take it for what it's worth. 显然,这是一个微观基准,只是温和的指导,所以把它当作它的价值。

Here's one way: 这是一种方式:

;WITH x AS
(
  SELECT   CarID
         , CarName
         , COUNT(*) AS Total
  FROM     dbo.tbl_Cars
  GROUP BY CarID, CarName
)
SELECT x.CarID, x.CarName, x.Total, x2.[Max Total]
FROM x CROSS JOIN
(
  SELECT [Max Total] = MAX(Total) FROM x
) AS x2;

SQL Server 2008 R2和更新版本,您可以使用:

GROUP BY CarID, CarName WITH ROLLUP

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM