简体   繁体   English

SQL查询分组后仍然有重复项

[英]SQL Query Still having duplicates after group by

SELECT *
FROM `eBayorders`
WHERE (`OrderIDAmazon` IS NULL
       OR `OrderIDAmazon` = "null")
  AND `Flag` = "True"
  AND `TYPE` = "GROUP"
  AND (`Carrier` IS NULL
       OR `Carrier` = "null")
  AND LEFT(`SKU`, 1) = "B"
  AND datediff(now(), `TIME`) < 4
  AND (`TrackingInfo` IS NULL
       OR `TrackingInfo` = "null")
  AND `STATUS` = "PROCESSING"
GROUP BY `Name`,
         `SKU`
ORDER BY `TIME` ASC LIMIT 7

I am trying to make sure that none of the names and skus will show up in the same result. 我试图确保所有名称和名称都不会出现在相同的结果中。 I am trying to group by name and then sku, however I ran into the problem where a result showed up that has the same name and different skus, which I dont want to happen. 我试图按名称分组,然后按sku分组,但是我遇到了一个问题,其中显示的结果具有相同的名称和不同的特征,我不想发生这种情况。 How can I fix this query to make sure that there is always distinct names and skus in the result set?! 如何解决此查询,以确保结果集中始终有不同的名称和名称?

For example say I have an Order: 例如说我有一个订单:

Name: Ben Z, SKU : B000334, oldest
Name: Ben Z, SKU : B000333, second oldest
Name: Will, SKU: B000334, third oldest
Name: John, SKU: B000036, fourth oldest

The query should return only:
Name: Ben Z, SKU : B000334, oldest
Name: John, SKU: B000036, fourth oldest

This is because all of the Names should only have one entry in the set along with SKU. 这是因为所有名称与SKU一起应在集合中只有一个条目。

There are two problems here. 这里有两个问题。

The first is the ANSI standard says that if you have a GROUP BY clause, the only things you can put in the SELECT clause are items listed in GROUP BY or items that use an aggregate function (SUM, COUNT, MAX, etc). 首先是ANSI标准,它说如果您有GROUP BY子句,则只能在SELECT子句中放入GROUP BY列出的项或使用聚合函数(SUM,COUNT,MAX等)的项。 The query in your question selects all the columns in the table, even those not in the GROUP BY . 您问题中的查询会选择表中的所有列,即使不是GROUP BY的列也是如此。 If you have multiple records that match a group, the table doesn't know which record to use for those extra columns. 如果您有多个与一个组匹配的记录,那么该表将不知道将哪些记录用于这些额外的列。

MySql is dumb about this. MySql对此很愚蠢。 A sane database server would throw an error and refuse to run that query. 健全的数据库服务器将抛出错误并拒绝运行该查询。 Sql Server, Oracle and Postgresql will all do that. Sql Server,Oracle和Postgresql都可以做到这一点。 MySql will make a guess about which data you want. MySql将猜测您想要的数据。 It's not usually a good idea to let your DB server make guesses about data. 让您的数据库服务器对数据进行猜测通常不是一个好主意。

But that doesn't explain the duplicates... just why the bad query runs at all. 但这并不能解释重复项...而是为什么根本运行错误查询的原因。 The reason you have duplicates is that you group on both Name and SKU . 重复的原因是您同时对NameSKU分组。 So, for example, for Ben Z 's record you want to see just the oldest SKU. 因此,例如,对于Ben Z的记录,您只想查看最早的SKU。 But when you group on both Name and SKU , you get a seperate group for { Ben Z, B000334 } and { Ben Z, B000333 } ... that's two rows for Ben Z, but it's what the query asked for, since SKU is also part of what determines a group. 但是,当您同时对NameSKU分组时,您将获得{ Ben Z, B000334 }{ Ben Z, B000333 }的单独组...这是Ben Z的两行,但这正是查询所要的,因为SKU为也是决定一个群体的一部分。

If you only want to see one record per person, you need to group by just the person fields. 如果您只想查看每人一条记录,则只需按人员字段分组即可。 This may mean building that part of the query first, to determine the base record set you need, and then JOINing to this original query as part of your full solution. 这可能意味着首先构建查询的该部分,以确定所需的基本记录集,然后将其加入到原始查询中,作为完整解决方案的一部分。

SELECT T1.*
FROM eBayorders T1
JOIN
  ( SELECT `Name`,
           `SKU`,
           max(`TIME`) AS MAX_TIME
   FROM eBayorders
   WHERE (`OrderIDAmazon` IS NULL OR `OrderIDAmazon` = "null") AND `Flag` = "True" AND `TYPE` = "GROUP" AND (`Carrier` IS NULL OR `Carrier` = "null") AND LEFT(`SKU`, 1) = "B" AND datediff(now(), `TIME`) < 4 AND (`TrackingInfo` IS NULL OR `TrackingInfo` = "null") AND `STATUS` = "PROCESSING"
   GROUP BY `Name`,
            `SKU`) AS dedupe ON T1.`Name` = dedupe.`Name`
AND T1.`SKU` = dedupe.`SKU`
AND T1.`Time` = dedupe.`MAX_TIME`
ORDER BY `TIME` ASC LIMIT 7

Your database platform should have complained because your original query had items in the select list which were not present in the group by (generally not allowed). 您的数据库平台应该抱怨,因为您的原始查询中选择列表中的项目不在分组依据中(通常不允许)。 The above should resolve it. 以上应该解决。

An even better option would be the following if your database supported window functions (MySQL doesn't, unfortunately): 如果您的数据库支持窗口功能(不幸的是,MySQL不支持),则更好的选择是:

SELECT *
FROM
  ( SELECT *,
           row_number() over (partition BY `Name`, `SKU`
                              ORDER BY `TIME` ASC) AS dedupe_rank
   FROM eBayorders
   WHERE (`OrderIDAmazon` IS NULL OR `OrderIDAmazon` = "null") AND `Flag` = "True" AND `TYPE` = "GROUP" AND (`Carrier` IS NULL OR `Carrier` = "null") AND LEFT(`SKU`, 1) = "B" AND datediff(now(), `TIME`) < 4 AND (`TrackingInfo` IS NULL OR `TrackingInfo` = "null") AND `STATUS` = "PROCESSING" ) T
WHERE dedupe_rank = 1
ORDER BY T.`TIME` ASC LIMIT 7

You are trying to obtain a result set which doesn't have repeats in either the SKU nor the Name column. 您正在尝试获取一个结果集,该结果集在SKU或“名称”列中都没有重复。

You might have to add a subquery to your query, to accomplish that. 为此,您可能必须在subquery添加一个subquery查询。 The inner query would group by Name, and the Outer query would group by SKU, such that you won't have repeats in either column. 内部查询将按名称分组,而外部查询将按SKU分组,这样您在任一列中都不会重复。

Try this : 尝试这个 :

SELECT *
FROM
  (SELECT *
   FROM eBayorders
   WHERE (`OrderIDAmazon` IS NULL
          OR `OrderIDAmazon` = "null")
     AND `Flag` = "True"
     AND `TYPE` = "GROUP"
     AND (`Carrier` IS NULL
          OR `Carrier` = "null")
     AND LEFT(`SKU`, 1) = "B"
     AND datediff(now(), `TIME`) < 4
     AND (`TrackingInfo` IS NULL
          OR `TrackingInfo` = "null")
     AND `STATUS` = "PROCESSING"
   GROUP BY Name)
GROUP BY `SKU`
ORDER BY `TIME` ASC LIMIT 7

With this approach you just filter out rows that do not contain the largest/latest value for TIME. 使用这种方法,您只需过滤掉不包含TIME最大值/最新值的行。

SELECT SKU, Name
FROM eBayOrders o
WHERE NOT EXISTS (SELECT 0 FROM eBayOrders WHERE Name = o.name and Time > o.Time)
GROUP BY SKU, Name

Note: If two records have exactly the same Name and Time values, you may still end up getting duplicates, because the logic you have specified does not provide any way to break up a tie. 注意:如果两个记录具有完全相同的“名称”和“时间”值,由于指定的逻辑无法提供打破平局的任何方式,您仍然可能最终会得到重复项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM