Finding duplicates of the same column data

Question

I've found a query that grabs all of the duplicates and groups them by the column name, but I need to display each record on it's own row, grouped by the column name...

What I'm suspicious of is that multiple records with the same design column have been uploaded, and I need to be able to compare each row so I determine which ones are active or not.

The following query seems like it would work, but crashes mysql each time I try and use it:

SELECT *
FROM 2009_product_catalog
WHERE sku IN (
    SELECT sku
    FROM 2009_product_catalog
    GROUP BY sku
    HAVING count(sku) > 1
    )
ORDER BY sku

I need all records to show, not just records that may be duplicates. The reason is, I need to be able to compare the rest of the columns, so I can know which duplicates need to go.

Answer 1

Your query is logically correct. However, MySQL has some problems with optimizing in with subquery. Try this version:

SELECT pc.*
FROM 2009_product_catalog pc join
     (SELECT sku
      FROM 2009_product_catalog
      GROUP BY sku
      HAVING count(sku) > 1
     ) pcsum
     on pcsum.sku = pc.sku
ORDER BY sku;

If that still doesn't work, then be sure you have an index on 2009_product_catalog(sku, pcid) (where pcid is the unique id of each row in the table. Then try this:

select pc.*
FROM 2009_product_catalog pc
where exists (select 1
              from 2009_product_catalog pc2
              where pc2.sku = pc.sku and pc2.pcid <> pc.pcid
             )

Answer 2

I think the IN or exists statement is very heavy performance.

Assume that your table has a field named id as your primary key. Remember create an index on your sku field.


SELECT pc.*
FROM 
    2009_product_catalog pc
        INNER JOIN 2009_product_catalog pc2 ON pc.sku = pc2.sku AND pc.id != pc2.id

Edit


SELECT pc.*, pc2.id as `pc2_id`
FROM 
    2009_product_catalog pc
        LEFT OUTER JOIN 2009_product_catalog pc2 ON pc.sku = pc2.sku AND pc.id != pc2.id

This query gives all records to you, every duplicated record has pc2_id is not null. If pc2_id is null, it's not duplicated. Otherwise, if the record has duplicated for more than 2 times, it will appear in your result more than 1 time, is it problem?

Answer 3

SELECT * FROM 2009_product_catalog t1 INNER JOIN
( SELECT sku FROM 2009_product_catalog GROUP BY sku HAVING COUNT(sku) > 1 ) t2
ON t1.sku = t2.sku

This is the alternate to the original query posted in your question. It uses joins instead of subquery, naturally joins are faster.

t1 is the original table. t2 contains only those rows which are duplicate. The result (inner join) will have records with duplicate sku.

Finding duplicates of the same column data

Question

3 answers

solution1
0 2013-06-04 23:54:39

solution2
0 2013-06-05 03:33:50

solution3
0 2013-06-05 17:50:50

Finding duplicates of the same column data

Question

3 answers

solution1 0 2013-06-04 23:54:39

solution2 0 2013-06-05 03:33:50

solution3 0 2013-06-05 17:50:50

solution1
0 2013-06-04 23:54:39

solution2
0 2013-06-05 03:33:50

solution3
0 2013-06-05 17:50:50