简体   繁体   English

查找相同列数据的重复项

[英]Finding duplicates of the same column data

I've found a query that grabs all of the duplicates and groups them by the column name, but I need to display each record on it's own row, grouped by the column name... 我找到了一个查询,该查询可捕获所有重复项并按列名将它们分组,但是我需要在其自己的行上显示每个记录,并按列名分组...

What I'm suspicious of is that multiple records with the same design column have been uploaded, and I need to be able to compare each row so I determine which ones are active or not. 我怀疑的是,上载具有相同设计列的多个记录,并且我需要能够比较每一行,以便确定哪些行处于活动状态。

The following query seems like it would work, but crashes mysql each time I try and use it: 下面的查询似乎可以正常工作,但是每次尝试使用它时,mysql都会崩溃:

SELECT *
FROM 2009_product_catalog
WHERE sku IN (
    SELECT sku
    FROM 2009_product_catalog
    GROUP BY sku
    HAVING count(sku) > 1
    )
ORDER BY sku

I need all records to show, not just records that may be duplicates. 我需要显示所有记录,而不仅仅是可能重复的记录。 The reason is, I need to be able to compare the rest of the columns, so I can know which duplicates need to go. 原因是,我需要能够比较其余的列,这样我才能知道需要重复哪些内容。

Your query is logically correct. 您的查询在逻辑上是正确的。 However, MySQL has some problems with optimizing in with subquery. 但是,MySQL in使用子查询进行优化方面存在一些问题。 Try this version: 试试这个版本:

SELECT pc.*
FROM 2009_product_catalog pc join
     (SELECT sku
      FROM 2009_product_catalog
      GROUP BY sku
      HAVING count(sku) > 1
     ) pcsum
     on pcsum.sku = pc.sku
ORDER BY sku;

If that still doesn't work, then be sure you have an index on 2009_product_catalog(sku, pcid) (where pcid is the unique id of each row in the table. Then try this: 如果仍然不能解决问题,请确保在2009_product_catalog(sku, pcid)上有一个索引(其中pcid是表中每一行的唯一ID。然后尝试执行以下操作:

select pc.*
FROM 2009_product_catalog pc
where exists (select 1
              from 2009_product_catalog pc2
              where pc2.sku = pc.sku and pc2.pcid <> pc.pcid
             )

I think the IN or exists statement is very heavy performance. 我认为INexists语句的性能非常高。

Assume that your table has a field named id as your primary key. 假设您的表有一个名为id的字段作为您的主键。 Remember create an index on your sku field. 记住在您的sku字段上创建一个索引。


SELECT pc.*
FROM 
    2009_product_catalog pc
        INNER JOIN 2009_product_catalog pc2 ON pc.sku = pc2.sku AND pc.id != pc2.id

Edit 编辑


SELECT pc.*, pc2.id as `pc2_id`
FROM 
    2009_product_catalog pc
        LEFT OUTER JOIN 2009_product_catalog pc2 ON pc.sku = pc2.sku AND pc.id != pc2.id

This query gives all records to you, every duplicated record has pc2_id is not null. 此查询为您提供所有记录,每个重复的记录的pc2_id不为null。 If pc2_id is null, it's not duplicated. 如果pc2_id为null,则不会重复。 Otherwise, if the record has duplicated for more than 2 times, it will appear in your result more than 1 time, is it problem? 否则,如果记录重复2次以上,它将在您的结果中出现1次以上,这有问题吗?

SELECT * FROM 2009_product_catalog t1 INNER JOIN
( SELECT sku FROM 2009_product_catalog GROUP BY sku HAVING COUNT(sku) > 1 ) t2
ON t1.sku = t2.sku

This is the alternate to the original query posted in your question. 这是您的问题中发布的原始查询的替代方法。 It uses joins instead of subquery, naturally joins are faster. 它使用联接而不是子查询,因此联接更快。

t1 is the original table. t1是原始表。 t2 contains only those rows which are duplicate. t2仅包含重复的行。 The result (inner join) will have records with duplicate sku. 结果(内部联接)将具有重复的sku记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM