简体   繁体   English

如何在“分组依据”查询中每组只选择一个完整的行?

[英]How to select only one full row per group in a "group by" query?

In SQL Server, I have a table where a column A stores some data.在 SQL Server 中,我有一个表,其中A列存储一些数据。 This data can contain duplicates (ie. two or more rows will have the same value for the column A ).此数据可以包含重复项(即,两行或多行对于A列将具有相同的值)。

I can easily find the duplicates by doing :我可以通过以下方式轻松找到重复项:

select A, count(A) as CountDuplicates
from TableName
group by A having (count(A) > 1)

Now, I want to retrieve the values of other columns, let's say B and C .现在,我想检索其他列的值,比如BC Of course, those B and C values can be different even for the rows sharing the same A value, but it doesn't matter for me.当然,即使对于共享相同A值的行,那些BC值也可能不同,但这对我来说并不重要。 I just want any B value and any C one, the first, the last or the random one.我只想要任何B值和任何C值,第一个、最后一个或随机值。

If I had a small table and one or two columns to retrieve, I would do something like:如果我有一张小桌子和一两列要检索,我会这样做:

select A, count(A) as CountDuplicates, (
    select top 1 child.B from TableName as child where child.A = base.A) as B
)
from TableName as base group by A having (count(A) > 1)

The problem is that I have much more rows to get, and the table is quite big, so having several children selects will have a high performance cost.问题是我有更多的行要获取,而且表很大,所以有几个孩子选择会产生很高的性能成本。

So, is there a less ugly pure SQL solution to do this?那么,有没有更丑陋的纯 SQL 解决方案来做到这一点?


Not sure if my question is clear enough, so I give an example based on AdventureWorks database.不确定我的问题是否足够清楚,所以我给出了一个基于AdventureWorks数据库的示例。 Let's say I want to list available States, and for each State, get its code, a city (any city) and an address (any address).假设我想列出可用的州,并为每个州获取其代码、城市(任何城市)和地址(任何地址)。 The easiest, and the most inefficient way to do it would be:最简单,最低效的方法是:

var q = from c in data.StateProvinces select new { c.StateProvinceCode, c.Addresses.First().City, c.Addresses.First().AddressLine1 };

in LINQ-to-SQL and will do two selects for each of 181 States, so 363 selects.在 LINQ-to-SQL 中,将为 181 个状态中的每一个执行两次选择,因此 363 次选择。 I my case, I am searching for a way to have a maximum of 182 selects.我的情况是,我正在寻找一种最多有 182 个选择的方法。

The ROW_NUMBER function in a CTE is the way to do this. CTE 中的ROW_NUMBER函数就是执行此操作的方法。 For example:例如:

DECLARE @mytab TABLE (A INT, B INT, C INT)
INSERT INTO @mytab ( A, B, C ) VALUES (1, 1, 1)
INSERT INTO @mytab ( A, B, C ) VALUES (1, 1, 2)
INSERT INTO @mytab ( A, B, C ) VALUES (1, 2, 1)
INSERT INTO @mytab ( A, B, C ) VALUES (1, 3, 1)
INSERT INTO @mytab ( A, B, C ) VALUES (2, 2, 2)
INSERT INTO @mytab ( A, B, C ) VALUES (3, 3, 1)
INSERT INTO @mytab ( A, B, C ) VALUES (3, 3, 2)
INSERT INTO @mytab ( A, B, C ) VALUES (3, 3, 3)
;WITH numbered AS 
(
    SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY A ORDER BY B, C)
        FROM @mytab AS m
)
SELECT *
    FROM numbered
    WHERE rn=1

As I mentioned in my comment to HLGEM and Philip Kelley, their simple use of an aggregate function does not necessarily return one "solid" record for each A group;正如我在对 HLGEM 和 Philip Kelley 的评论中提到的,他们对聚合函数的简单使用并不一定会为每个 A 组返回一个“可靠”记录; instead, it may return column values from many separate rows, all stitched together as if they were a single record.相反,它可能会从许多单独的行返回列值,所有行都拼接在一起,就好像它们是一条记录一样。 For example, if this were a PERSON table, with the PersonID being the "A" column, and distinct contact records (say, Home and Word), you might wind up returning the person's home city, but their office ZIP code -- and that's clearly asking for trouble.例如,如果这是一个 PERSON 表,其中 PersonID 为“A”列,并且有不同的联系人记录(例如,Home 和 Word),您最终可能会返回此人的家乡城市,但他们的办公室邮政编码 - 和这显然是在自找麻烦。

The use of the ROW_NUMBER, in conjunction with a CTE here, is a little difficult to get used to at first because the syntax is awkward. ROW_NUMBER 与此处的 CTE 一起使用,起初有点难以习惯,因为语法很尴尬。 But it's becoming a pretty common pattern, so it's good to get to know it.但它正在成为一种非常普遍的模式,所以了解它是件好事。

In my sample I've define a CTE that tacks on an extra column rn (standing for "row number") to the table, that itself groups by the A column.在我的示例中,我定义了一个 CTE,它将额外的列rn (代表“行号”)附加到表中,该列本身按 A 列分组。 A SELECT on that result, filtering to only those having a row number of 1 (ie, the first record found for that value of A), returns a "solid" record for each A group -- in my example above, you'd be certain to get either the Work or Home address, but not elements of both mixed together.对该结果的SELECT筛选,仅过滤到行号为 1 的那些(即为 A 的值找到的第一条记录),为每个 A 组返回一个“可靠”记录——在我上面的示例中,你会一定要获得工作地址家庭地址,但不要将两者的元素混合在一起。

It concerns me that you want any old value for fields b and c.我担心你想要字段 b 和 c 的任何旧值。 If they are to be meaningless why are you returning them?如果它们毫无意义,您为什么要退回它们?

If it truly doesn't matter (and I honestly can't imagine a case where I would ever want this, but it's what you said) and the values for b and c don't even have to be from the same record, group by with the use of mon or max is the way to go.如果它真的无关紧要(老实说,我无法想象我会想要这个的情况,但这就是你所说的)并且 b 和 c 的值甚至不必来自同一个记录,组通过使用 mon 或 max 是要走的路。 It's more complicated if you want the values for a particular record for all fields.如果您想要所有字段的特定记录的值,那就更复杂了。

select A, count(A) as CountDuplicates, min(B) as B , min(C) as C
from TableName as base 
group by A 
having (count(A) > 1) 

you can do some thing like this if you have id as primary key in your table如果你的表中有 id 作为主键,你可以做这样的事情

select id,b,c from tablename 
inner join
(
select id, count(A) as CountDuplicates
from TableName as base group by A,id having (count(A) > 1) 
)d on tablename.id= d.id

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM