简体   繁体   English

在SQL中以一对多关系查找重复项

[英]Find duplicates in a one to many relationship in SQL

Problem 问题

I have 2 tables: 我有2张桌子:

Table tTag
idTag int
otherColumns

And

Table tTagWord
idTagWord int
idTag int
idWord int
position int

For example: 例如:

在此处输入图片说明

So each idTag will have multiple idTagWord (unknown number), the position is important too. 因此,每个idTag将具有多个idTagWord(未知数字),位置也很重要。 I try to find the best way, for the performance, to find the duplicates. 我试图找到最佳的性能,以找到重复项。

A duplicate would be to have the same idWords in the same order (position) for 2 different idTag. 对于两个不同的idTag,重复项将以相同的顺序(位置)具有相同的idWord。

What I have tried 我尝试过的

SELECT GROUP_CONCAT(DISTINCT tab.idTag SEPARATOR ',') INTO @idTagSet
FROM (  SELECT idTag,GROUP_CONCAT(idWord order by position ASC SEPARATOR ' ') AS Tag
        FROM tTagWord
        GROUP BY idTag) AS tab
INNER JOIN (SELECT idTag,GROUP_CONCAT(idWord order by position ASC SEPARATOR ' ') AS Tag
            FROM tTagWord
            GROUP BY idTag) AS tab2 ON tab.Tag = tab2.Tag
WHERE tab.idTag <> tab2.idTag;

The previous query returns a set of the duplicate idTags, so it works. 上一个查询返回一组重复的idTag,因此可以正常工作。 But the performance is terrible. 但是性能太差了。 With 150 000 idTag, it already takes several minutes and the table will soon have millions of idTag. 有了15万个idTag,它已经花费了几分钟,并且表很快就会有数百万个idTag。

I also tried something like this answer 我也尝试过这样的答案

select idTag, GROUP_CONCAT(idWord order by position ASC SEPARATOR '-') AS idWordSet
from tTagWord
group by idTag
Having COUNT(idWordSet) > 1;

But I can't seem to find a way. 但是我似乎找不到办法。 Any idea? 任何想法?

How about trying two group by s? 尝试两个group by s怎么样?

SELECT words, count(*), group_concat(idtag) as tags
FROM (SELECT idTag, GROUP_CONCAT(idWord order by position ASC SEPARATOR ' ') AS words
      FROM tTagWord
      GROUP BY idTag
     ) t
GROUP BY words
HAVING count(*) > 1;

This type of query is sometimes known as relational division, there's a whole bunch of methods at https://www.simple-talk.com/sql/t-sql-programming/divided-we-stand-the-sql-of-relational-division/ 这种查询有时称为关系除法, https: //www.simple-talk.com/sql/t-sql-programming/divided-we-stand-the-sql-of-有很多方法关系分/

One example is: 一个例子是:

select
    t1.idTag as tag1,
    t2.IdTag as tag2
from
    tTagWord t1
        inner join
    tTagWord t2
        on t1.idWord = t2.idWord and
           t1.position = t2.position and
           t1.idTag < t2.idTag
group by
    t1.idTag,
    t2.idTag
having
    count(*) = (
        select
            count(*)
        from
            tTagWord t3
        where
            t3.idTag = t1.idTag
    ) and
    count(*) = (
        select
            count(*)
        from
            tTagWord t4
        where
            t4.idTag = t2.idTag
    );

Here's an example . 这是一个例子 I've put Gordon's query there too. 我也把戈登的查询放在那儿。 They might have different performance characteristics. 它们可能具有不同的性能特征。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM