简体   繁体   English

按 T-SQL 中的每 N 条记录分组

[英]Group by every N records in T-SQL

I have some performance test results on the database, and what I want to do is to group every 1000 records (previously sorted in ascending order by date) and then aggregate results with AVG .我在数据库上有一些性能测试结果,我想做的是对每 1000 条记录进行分组(之前按日期升序排序),然后使用AVG聚合结果。

I'm actually looking for a standard SQL solution, however any T-SQL specific results are also appreciated.我实际上正在寻找标准的 SQL 解决方案,但是任何 T-SQL 特定结果也值得赞赏。

The query looks like this:查询如下所示:

SELECT TestId,Throughput  FROM dbo.Results ORDER BY id
WITH T AS (
  SELECT RANK() OVER (ORDER BY ID) Rank,
    P.Field1, P.Field2, P.Value1, ...
  FROM P
)
SELECT (Rank - 1) / 1000 GroupID, AVG(...)
FROM T
GROUP BY ((Rank - 1) / 1000)
;

Something like that should get you started.这样的事情应该让你开始。 If you can provide your actual schema I can update as appropriate.如果您可以提供您的实际架构,我可以酌情更新。

Give the answer to Yuck.把答案交给Yuck。 I only post as an answer so I could include a code block.我只发布作为答案,所以我可以包含一个代码块。 I did a count test to see if it was grouping by 1000 and the first set was 999. This produced set sizes of 1,000.我做了一个计数测试,看看它是否按 1000 分组,第一组是 999。这产生了 1000 组大小。 Great query Yuck.很好的问题。

    WITH T AS (
    SELECT RANK() OVER (ORDER BY sID) Rank, sID 
    FROM docSVsys
    )
    SELECT (Rank-1) / 1000 GroupID, count(sID)
    FROM T
    GROUP BY ((Rank-1) / 1000)
    order by GroupID 

I +1'd @Yuck, because I think that is a good answer.我为@Yuck +1,因为我认为这是一个很好的答案。 But it's worth mentioning NTILE().但值得一提的是 NTILE()。

Reason being, if you have 10,010 records (for example), then you'll have 11 groupings -- the first 10 with 1000 in them, and the last with just 10.原因是,如果您有 10,010 条记录(例如),那么您将有 11 个分组——前 10 个分组中有 1000 个,最后一个分组只有 10 个。

If you're comparing averages between each group of 1000, then you should either discard the last group as it's not a representative group, or...you could make all the groups the same size.如果您要比较每组 1000 人的平均值,那么您应该丢弃最后一组,因为它不是具有代表性的组,或者......您可以使所有组的大小相同。

NTILE() would make all groups the same size; NTILE() 将使所有组的大小相同; the only caveat is that you'd need to know how many groups you wanted.唯一需要注意的是,您需要知道您想要多少组。

So if your table had 25,250 records, you'd use NTILE(25), and your groupings would be approximately 1000 in size -- they'd actually be 1010 in size;因此,如果您的表有 25,250 条记录,您将使用 NTILE(25),并且您的分组大小约为1000——实际上它们的大小为 1010; the benefit being, they'd all be the same size, which might make them more relevant to each other in terms of whatever comparison analysis you're doing.好处是,它们的大小都相同,这可能会使它们在您进行的任何比较分析方面更加相关。

You could get your group-size simply by您可以简单地通过

DECLARE @ntile int
SET  @ntile = (SELECT count(1) from myTable) / 1000

And then modifying @Yuck's approach with the NTILE() substitution:然后用 NTILE() 替换修改@Yuck 的方法:

;WITH myCTE AS (
  SELECT NTILE(@ntile) OVER (ORDER BY id) myGroup,
    col1, col2, ...
  FROM dbo.myTable
)
SELECT myGroup, col1, col2...
FROM myCTE
GROUP BY (myGroup), col1, col2...
;

You can also use Row_Number() instead of rank.您也可以使用 Row_Number() 代替排名。 No Floor required.无需楼层。

declare @groupsize int = 50

;with ct1 as (  select YourColumn, RowID = Row_Number() over(order by YourColumn)
                from YourTable
             )

select YourColumn, RowID, GroupID = (RowID-1)/@GroupSize + 1
from ct1

Answer above does not actually assign a unique group id to each 1000 records.上面的答案实际上并没有为每 1000 条记录分配一个唯一的组 ID。 Adding Floor() is needed.需要添加 Floor()。 The following will return all records from your table, with a unique GroupID for each 1000 rows:以下将返回表中的所有记录,每 1000 行具有唯一的 GroupID:

WITH T AS (
  SELECT RANK() OVER (ORDER BY your_field) Rank,
    your_field
  FROM your_table
  WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
FROM T

And for my needs, I wanted my GroupID to be a random set of characters, so I changed the Floor(...) GroupID to:为了我的需要,我希望我的 GroupID 是一组随机字符,所以我将 Floor(...) GroupID 更改为:

TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 10) AS STRING),'seed1'))) GroupID

without the seed value, you and I would get the exact same output because we're just doing a SHA256 on the number 1, 2, 3 etc. But adding the seed makes the output unique, but still repeatable.如果没有种子值,你和我会得到完全相同的 output 因为我们只是对数字 1、2、3 等进行 SHA256。但是添加种子会使 output 独一无二,但仍然可以重复。

This is BigQuery syntax.这是 BigQuery 语法。 T-SQL might be slightly different. T-SQL 可能略有不同。

Lastly, if you want to leave off the last chunk that is not a full 1000, you can find it by doing:最后,如果你想去掉最后一个不是完整 1000 的块,你可以通过以下方式找到它:

WITH T AS (
  SELECT RANK() OVER (ORDER BY your_field) Rank,
    your_field
  FROM your_table
  WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
, COUNT(*) OVER(PARTITION BY TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 1000) AS STRING),'seed1')))) AS CountInGroup
FROM T
ORDER BY CountInGroup

I read more about NTILE after reading @user15481328 answer (resource: https://www.sqlservertutorial.net/sql-server-window-functions/sql-server-ntile-function/ )在阅读@user15481328 答案后,我阅读了有关 NTILE 的更多信息(资源: https://www.sqlservertutorial.net/sql-server-window-functions/sql-server-ntile-function/

and this solution allowed me to find the max date within each of the 25 groups of my data set:这个解决方案让我可以在我的数据集的 25 组中找到最大日期:

with cte as (
    select date,
           NTILE(25) OVER ( order by date ) bucket_num

    from mybigdataset
)
select max(date), bucket_num
from cte
group by bucket_num
order by bucket_num

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM