简体   繁体   中英

Group by every N records in T-SQL

I have some performance test results on the database, and what I want to do is to group every 1000 records (previously sorted in ascending order by date) and then aggregate results with AVG .

I'm actually looking for a standard SQL solution, however any T-SQL specific results are also appreciated.

The query looks like this:

SELECT TestId,Throughput  FROM dbo.Results ORDER BY id
WITH T AS (
  SELECT RANK() OVER (ORDER BY ID) Rank,
    P.Field1, P.Field2, P.Value1, ...
  FROM P
)
SELECT (Rank - 1) / 1000 GroupID, AVG(...)
FROM T
GROUP BY ((Rank - 1) / 1000)
;

Something like that should get you started. If you can provide your actual schema I can update as appropriate.

Give the answer to Yuck. I only post as an answer so I could include a code block. I did a count test to see if it was grouping by 1000 and the first set was 999. This produced set sizes of 1,000. Great query Yuck.

    WITH T AS (
    SELECT RANK() OVER (ORDER BY sID) Rank, sID 
    FROM docSVsys
    )
    SELECT (Rank-1) / 1000 GroupID, count(sID)
    FROM T
    GROUP BY ((Rank-1) / 1000)
    order by GroupID 

I +1'd @Yuck, because I think that is a good answer. But it's worth mentioning NTILE().

Reason being, if you have 10,010 records (for example), then you'll have 11 groupings -- the first 10 with 1000 in them, and the last with just 10.

If you're comparing averages between each group of 1000, then you should either discard the last group as it's not a representative group, or...you could make all the groups the same size.

NTILE() would make all groups the same size; the only caveat is that you'd need to know how many groups you wanted.

So if your table had 25,250 records, you'd use NTILE(25), and your groupings would be approximately 1000 in size -- they'd actually be 1010 in size; the benefit being, they'd all be the same size, which might make them more relevant to each other in terms of whatever comparison analysis you're doing.

You could get your group-size simply by

DECLARE @ntile int
SET  @ntile = (SELECT count(1) from myTable) / 1000

And then modifying @Yuck's approach with the NTILE() substitution:

;WITH myCTE AS (
  SELECT NTILE(@ntile) OVER (ORDER BY id) myGroup,
    col1, col2, ...
  FROM dbo.myTable
)
SELECT myGroup, col1, col2...
FROM myCTE
GROUP BY (myGroup), col1, col2...
;

You can also use Row_Number() instead of rank. No Floor required.

declare @groupsize int = 50

;with ct1 as (  select YourColumn, RowID = Row_Number() over(order by YourColumn)
                from YourTable
             )

select YourColumn, RowID, GroupID = (RowID-1)/@GroupSize + 1
from ct1

Answer above does not actually assign a unique group id to each 1000 records. Adding Floor() is needed. The following will return all records from your table, with a unique GroupID for each 1000 rows:

WITH T AS (
  SELECT RANK() OVER (ORDER BY your_field) Rank,
    your_field
  FROM your_table
  WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
FROM T

And for my needs, I wanted my GroupID to be a random set of characters, so I changed the Floor(...) GroupID to:

TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 10) AS STRING),'seed1'))) GroupID

without the seed value, you and I would get the exact same output because we're just doing a SHA256 on the number 1, 2, 3 etc. But adding the seed makes the output unique, but still repeatable.

This is BigQuery syntax. T-SQL might be slightly different.

Lastly, if you want to leave off the last chunk that is not a full 1000, you can find it by doing:

WITH T AS (
  SELECT RANK() OVER (ORDER BY your_field) Rank,
    your_field
  FROM your_table
  WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
, COUNT(*) OVER(PARTITION BY TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 1000) AS STRING),'seed1')))) AS CountInGroup
FROM T
ORDER BY CountInGroup

I read more about NTILE after reading @user15481328 answer (resource: https://www.sqlservertutorial.net/sql-server-window-functions/sql-server-ntile-function/ )

and this solution allowed me to find the max date within each of the 25 groups of my data set:

with cte as (
    select date,
           NTILE(25) OVER ( order by date ) bucket_num

    from mybigdataset
)
select max(date), bucket_num
from cte
group by bucket_num
order by bucket_num

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM