[英]Find duplicate records in MySQL
我想提取 MySQL 数据库中的重复记录。 这可以通过以下方式完成:
SELECT address, count(id) as cnt FROM list
GROUP BY address HAVING cnt > 1
结果是:
100 MAIN ST 2
我想拉它,以便它显示重复的每一行。 就像是:
JIM JONES 100 MAIN ST
JOHN SMITH 100 MAIN ST
关于如何做到这一点的任何想法? 我试图避免做第一个,然后在代码中使用第二个查询查找重复项。
关键是重写这个查询,以便它可以用作子查询。
SELECT firstname,
lastname,
list.address
FROM list
INNER JOIN (SELECT address
FROM list
GROUP BY address
HAVING COUNT(id) > 1) dup
ON list.address = dup.address;
SELECT date FROM logs group by date having count(*) >= 2
为什么不直接将表与自身进行INNER JOIN
?
SELECT a.firstname, a.lastname, a.address
FROM list a
INNER JOIN list b ON a.address = b.address
WHERE a.id <> b.id
如果地址可能存在两次以上,则需要DISTINCT
。
我尝试了为这个问题选择的最佳答案,但这让我有些困惑。 我实际上只需要在我的表中的一个字段上使用它。 此链接中的以下示例对我来说效果很好:
SELECT COUNT(*) c,title FROM `data` GROUP BY title HAVING c > 1;
这不是更容易吗:
SELECT *
FROM tc_tariff_groups
GROUP BY group_id
HAVING COUNT(group_id) >1
?
select `cityname` from `codcities` group by `cityname` having count(*)>=2
这是您要求的类似查询,它也 200% 工作且简单。 享受!!!
使用此查询按电子邮件地址查找重复用户...
SELECT users.name, users.uid, users.mail, from_unixtime(created)
FROM users
INNER JOIN (
SELECT mail
FROM users
GROUP BY mail
HAVING count(mail) > 1
) dupes ON users.mail = dupes.mail
ORDER BY users.mail;
我们可以发现重复项也取决于多个字段。对于这些情况,您可以使用以下格式。
SELECT COUNT(*), column1, column2
FROM tablename
GROUP BY column1, column2
HAVING COUNT(*)>1;
查找重复地址比看起来要复杂得多,尤其是在您需要准确性的情况下。 在这种情况下,一个 MySQL 查询是不够的......
我在SmartyStreets工作,我们在那里解决验证和重复数据删除等问题,我已经看到了许多具有类似问题的不同挑战。
有几个第三方服务会为您在列表中标记重复项。 仅使用 MySQL 子查询执行此操作不会考虑地址格式和标准的差异。 USPS(针对美国地址)有一些制定这些标准的指导方针,但只有少数供应商获得了执行此类操作的认证。
因此,我建议您的最佳答案是将表格导出为 CSV 文件,例如,然后将其提交给有能力的列表处理器。 LiveAddress就是其中之一,它会在几秒钟到几分钟内自动为您完成。 它将使用一个名为“Duplicate”的新字段和其中的Y
值来标记重复的行。
另一种解决方案是使用表别名,如下所示:
SELECT p1.id, p2.id, p1.address
FROM list AS p1, list AS p2
WHERE p1.address = p2.address
AND p1.id != p2.id
在这种情况下,您真正要做的就是获取原始列表,从中创建两个p表示表 - p 1和p 2 ,然后在地址列上执行连接(第 3 行)。 第 4 行确保同一记录不会在您的结果集中多次出现(“重复的重复项”)。
效率不会很高,但应该可以:
SELECT *
FROM list AS outer
WHERE (SELECT COUNT(*)
FROM list AS inner
WHERE inner.address = outer.address) > 1;
这将在一个表传递中选择重复项,没有子查询。
SELECT *
FROM (
SELECT ao.*, (@r := @r + 1) AS rn
FROM (
SELECT @_address := 'N'
) vars,
(
SELECT *
FROM
list a
ORDER BY
address, id
) ao
WHERE CASE WHEN @_address <> address THEN @r := 0 ELSE 0 END IS NOT NULL
AND (@_address := address ) IS NOT NULL
) aoo
WHERE rn > 1
此查询实际上模拟了Oracle
和SQL Server
中存在的ROW_NUMBER()
详情请看我博客中的文章:
MySQL
中模拟。这还将向您显示有多少重复项,并将在没有连接的情况下对结果进行排序
SELECT `Language` , id, COUNT( id ) AS how_many
FROM `languages`
GROUP BY `Language`
HAVING how_many >=2
ORDER BY how_many DESC
select * from table_name t1 inner join (select distinct <attribute list> from table_name as temp)t2 where t1.attribute_name = t2.attribute_name
对于您的桌子,它将类似于
select * from list l1 inner join (select distinct address from list as list2)l2 where l1.address=l2.address
此查询将为您提供列表表中所有不同的地址条目...如果您有任何主键值作为名称等,我不确定这将如何工作。
最快的重复删除查询程序:
/* create temp table with one primary column id */
INSERT INTO temp(id) SELECT MIN(id) FROM list GROUP BY (isbn) HAVING COUNT(*)>1;
DELETE FROM list WHERE id IN (SELECT id FROM temp);
DELETE FROM temp;
就个人而言,这个查询已经解决了我的问题:
SELECT `SUB_ID`, COUNT(SRV_KW_ID) as subscriptions FROM `SUB_SUBSCR` group by SUB_ID, SRV_KW_ID HAVING subscriptions > 1;
该脚本的作用是显示在表中多次存在的所有订户 ID 以及找到的重复项的数量。
这是表格列:
| SUB_SUBSCR_ID | int(11) | NO | PRI | NULL | auto_increment |
| MSI_ALIAS | varchar(64) | YES | UNI | NULL | |
| SUB_ID | int(11) | NO | MUL | NULL | |
| SRV_KW_ID | int(11) | NO | MUL | NULL | |
希望对您也有帮助!
SELECT firstname, lastname, address FROM list
WHERE
Address in
(SELECT address FROM list
GROUP BY address
HAVING count(*) > 1)
SELECT t.*,(select count(*) from city as tt where tt.name=t.name) as count FROM `city` as t where (select count(*) from city as tt where tt.name=t.name) > 1 order by count desc
用您的表替换城市。 将名称替换为您的字段名称
我使用以下内容:
SELECT * FROM mytable
WHERE id IN (
SELECT id FROM mytable
GROUP BY column1, column2, column3
HAVING count(*) > 1
)
SELECT *
FROM (SELECT address, COUNT(id) AS cnt
FROM list
GROUP BY address
HAVING ( COUNT(id) > 1 ))
Find duplicate Records:
Suppose we have table : Student
student_id int
student_name varchar
Records:
+------------+---------------------+
| student_id | student_name |
+------------+---------------------+
| 101 | usman |
| 101 | usman |
| 101 | usman |
| 102 | usmanyaqoob |
| 103 | muhammadusmanyaqoob |
| 103 | muhammadusmanyaqoob |
+------------+---------------------+
Now we want to see duplicate records
Use this query:
select student_name,student_id ,count(*) c from student group by student_id,student_name having c>1;
+--------------------+------------+---+
| student_name | student_id | c |
+---------------------+------------+---+
| usman | 101 | 3 |
| muhammadusmanyaqoob | 103 | 2 |
+---------------------+------------+---+
SELECT id, count(*) as c
FROM 'list'
GROUP BY id HAVING c > 1
这将返回您的 id 重复 id 的次数,或者什么都没有,在这种情况下您将没有重复的 id。
将组中的 id 更改为(例如:地址),它将返回由第一个找到的具有该地址的 id 标识的地址重复的次数。
SELECT id, count(*) as c
FROM 'list'
GROUP BY address HAVING c > 1
我希望它有所帮助。 享受 ;)
要快速查看重复的行,您可以运行一个简单的查询
在这里,我正在查询表并列出所有具有相同 user_id、market_place 和 sku 的重复行:
select user_id, market_place,sku, count(id)as totals from sku_analytics group by user_id, market_place,sku having count(id)>1;
要删除重复的行,您必须决定要删除哪一行。 例如,具有较低 id(通常较旧)或其他一些日期信息的那个。 就我而言,我只想删除较低的 id,因为较新的 id 是最新信息。
首先仔细检查是否会删除正确的记录。 在这里,我在将被删除的重复项中选择记录(通过唯一 ID)。
select a.user_id, a.market_place,a.sku from sku_analytics a inner join sku_analytics b where a.id< b.id and a.user_id= b.user_id and a.market_place= b.market_place and a.sku = b.sku;
然后我运行删除查询来删除欺骗:
delete a from sku_analytics a inner join sku_analytics b where a.id< b.id and a.user_id= b.user_id and a.market_place= b.market_place and a.sku = b.sku;
备份,仔细检查,验证,验证备份然后执行。
SELECT * FROM bookings
WHERE DATE( created_at
) = '2022-01-11' AND code
IN (SELECT code
from bookings
GROUP BY code
HAVING COUNT( code
) > 1) ORDER BY id
DESC
当您有多个重复结果和/或当您有多个列来检查重复时,这里的大多数答案都无法解决。 在这种情况下,您可以使用此查询来获取所有重复的 id:
SELECT address, email, COUNT(*) AS QUANTITY_DUPLICATES, GROUP_CONCAT(id) AS ID_DUPLICATES
FROM list
GROUP BY address, email
HAVING COUNT(*)>1;
如果要将每个结果列为单行,则需要更复杂的查询。 这是我发现的工作:
CREATE TEMPORARY TABLE IF NOT EXISTS temptable AS (
SELECT GROUP_CONCAT(id) AS ID_DUPLICATES
FROM list
GROUP BY address, email
HAVING COUNT(*)>1
);
SELECT d.*
FROM list AS d, temptable AS t
WHERE FIND_IN_SET(d.id, t.ID_DUPLICATES)
ORDER BY d.id;
go 会是这样的:
SELECT t1.firstname t1.lastname t1.address FROM list t1
INNER JOIN list t2
WHERE
t1.id < t2.id AND
t1.address = t2.address;
select address from list where address = any (select address from (select address, count(id) cnt from list group by address having cnt > 1 ) as t1) order by address
内部子查询返回具有重复地址的行,然后外部子查询返回具有重复地址的地址列。 外部子查询必须只返回一列,因为它用作运算符 '= any' 的操作数
Powerlord 的答案确实是最好的,我建议再做一个更改:使用 LIMIT 来确保 db 不会过载:
SELECT firstname, lastname, list.address FROM list
INNER JOIN (SELECT address FROM list
GROUP BY address HAVING count(id) > 1) dup ON list.address = dup.address
LIMIT 10
如果没有 WHERE 并且在进行连接时使用 LIMIT 是一个好习惯。 从小值开始,检查查询的重量,然后增加限制。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.