简体   繁体   English

在多个表中查找重复项

[英]Find duplicates across multiple tables

I have a table (T1) and a table with attributes (T2). 我有一个表(T1)和一个带有属性的表(T2)。 I'm looking to find records that have the same attributes as a record with provided id. 我正在寻找具有与具有提供的ID的记录相同的属性的记录。

Here's the example. 这是例子。 Given 1 I want to find 2 (ensuring that attributes match as well). 给定1,我想找到2(确保属性也匹配)。

T1
ID | A | B
----------
1  | k | l
2  | k | l


T2
IDFK | C | D
-------------    
1    | w | x
1    | y | z
2    | w | x
2    | y | z

Here's the SQL I have so far: 这是到目前为止的SQL:

SELECT * FROM T1 
JOIN T1 AS T1COPY ON T1.A = T1COPY.A, T1.B = T1COPY.B 
JOIN T2 ON T1.ID = T2.IDFK 
JOIN T2 AS T2COPY ON T1COPY.ID = T2COPY.IDFK 
   AND T2.C = T2COPY.C 
   AND T2.D = T2COPY.D
WHERE T1.ID = 1

but it's not working right as it's matching 2 even if attributes are different. 但是它不能正常工作,因为即使属性不同,它也匹配2。

Here's the answer for MySQL: http://www.sqlfiddle.com/#!2/ec4fa/2 这是MySQL的答案: http ://www.sqlfiddle.com/#!2/ ec4fa/2

select h.* 
from 
(
    select x.*
    from t join t x using(a,b)
    where t.id = 1 and x.id <> 1  
) h
join 
(

    select coalesce(x.cpIdFk, x.uIdFk) as idFk  
    from
    (
      select cp.idFk as cpIdFk, u.idFk as uIdFk
      from 
      (
        select t.id as idFk, x.*
        from t cross join (select c, d from u where idFk = 1) as x
        where t.id <> 1      
      ) cp
      left join (select * from u where idFk <> 1) u using(idfk,c,d)

      union

      select cp.idFk,u.idFk
      from 
      (
        select t.id as idFk, x.*
        from t cross join (select c, d from u where idFk = 1) as x
        where t.id <> 1      
      ) cp
      right join (select * from u where idFk <> 1) u using(idfk,c,d)

    ) as x

    group by idFk
    having bit_and(cpidFk is not null and uIdFk is not null)

) d on d.idFk = h.id 
order by h.id;

Output for filter ID == 1: 过滤器ID == 1的输出:

| ID | A | B |
--------------
|  2 | k | l |
|  5 | k | l |

From these inputs: 从这些输入中:

CREATE TABLE t
    (ID int, A varchar(1), B varchar(1));

INSERT INTO t
    (ID, A, B)
VALUES
    (1, 'k', 'l'),
    (2, 'k', 'l'),
    (3, 'k', 'l'),
    (4, 'k', 'l'),
    (5, 'k', 'l'),
    (6, 'k', 'j');


CREATE TABLE u
    (IDFK int, C varchar(1), D varchar(1));

INSERT INTO u
    (IDFK, C, D)
VALUES
    (1, 'w', 'x'),
    (1, 'y', 'z'),

    (2, 'w', 'x'),
    (2, 'y', 'z'),

    (3, 'w', 'x'),
    (3, 'y', 'z'),
    (3, 'm', 'z'),

    (4, 'w', 'x'),

    (5, 'w', 'x'),
    (5, 'y', 'z'),

    (6, 'w', 'x'),
    (6, 'y', 'z');

Explanation here: Find duplicates across multiple tables 此处的说明: 在多个表中查找重复项

MySQL query look a little bit convoluted as it doesn't support FULL JOIN and it doesn't have CTEs too. MySQL查询看起来有些混乱,因为它不支持FULL JOIN,也没有CTE。 We simulate FULL JOIN by unioning the result of LEFT JOIN and RIGHT JOIN 我们通过合并LEFT JOINRIGHT JOIN的结果来模拟FULL JOIN RIGHT JOIN

Second Revised Answer 第二修订答案

Since the comments state that there can be duplicates rows in T2, a still more complex solution is needed. 由于注释指出T2中可以有重复的行,因此需要一个更复杂的解决方案。 Here's a query that, I believe, generates the correct data. 我相信这是一个查询,可以生成正确的数据。

-- Query 8B
SELECT x.id
  FROM (SELECT d2.id, d2.c, d2.d
          FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS d2
          JOIN (SELECT id
                  FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS x
                 GROUP BY id
                HAVING COUNT(*) = (SELECT COUNT(*)
                                     FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS x
                                    GROUP BY id)
               ) AS j2
            ON j2.id = d2.id
       ) AS x
  JOIN (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS y
    ON x.c = y.c AND x.d = y.d
 GROUP BY x.id
HAVING COUNT(*) = (SELECT COUNT(*)
                     FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS x
                    GROUP BY id);

I doubt if it is the simplest possible, but it is a logical continuation of the previous revised answer. 我怀疑这是否是最简单的方法,但这是先前修订的答案的逻辑延续。

Example run 运行示例

Here's the trace output of the query showing the steps as it was developed. 这是查询的跟踪输出,显示了开发步骤。 The DBMS is IBM Informix Dynamic Server 11.70.FC2 running on Mac OS X 10.7.4, using SQLCMD v88.00 as the SQL command interpreter (no, not Microsoft's johnny-come-lately; the one I first wrote over twenty years ago). DBMS是在Mac OS X 10.7.4上运行的IBM Informix Dynamic Server 11.70.FC2,使用SQLCMD v88.00作为SQL命令解释器(不,不是Microsoft的johnny-come-lately;我二十多年前第一次写的那个) 。

+ BEGIN;
+ CREATE TABLE T1
(ID INTEGER NOT NULL PRIMARY KEY, a CHAR(1) NOT NULL, b CHAR(1) NOT  NULL);
+ INSERT INTO T1 VALUES(1, 'k', 'l');
+ INSERT INTO T1 VALUES(2, 'k', 'l');
+ INSERT INTO T1 VALUES(3, 'a', 'b');
+ INSERT INTO T1 VALUES(4, 'p', 'q');
+ INSERT INTO T1 VALUES(5, 't', 'v');
+ CREATE TABLE T2
(IDFK INTEGER NOT NULL REFERENCES T1, c CHAR(1) NOT NULL, d CHAR(1) NOT  NULL);
+ INSERT INTO T2 VALUES(1, 'w', 'x');
+ INSERT INTO T2 VALUES(1, 'y', 'z');
+ INSERT INTO T2 VALUES(2, 'w', 'x');
+ INSERT INTO T2 VALUES(2, 'w', 'x');
+ INSERT INTO T2 VALUES(2, 'y', 'z');
+ INSERT INTO T2 VALUES(3, 'w', 'x');
+ INSERT INTO T2 VALUES(3, 'y', 'b');
+ INSERT INTO T2 VALUES(3, 'y', 'z');
+ INSERT INTO T2 VALUES(4, 'w', 'x');
+ INSERT INTO T2 VALUES(5, 'w', 'x');
+ INSERT INTO T2 VALUES(5, 'y', 'z');
+ INSERT INTO T2 VALUES(5, 'w', 'x');
+ INSERT INTO T2 VALUES(5, 'y', 'z');
+ SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1;
2|w|x
2|y|z
3|w|x
3|y|b
3|y|z
4|w|x
5|w|x
5|y|z
+ SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1;
1|w|x
1|y|z
+ SELECT id, COUNT(*) FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS x GROUP BY id;
2|2
5|2
3|3
4|1
+ SELECT id, COUNT(*) FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS x GROUP BY id;
1|2
+ -- Query 5B - IDs having same count of distinct rows as ID = 1
SELECT id
  FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS x
 GROUP BY id
HAVING COUNT(*) = (SELECT COUNT(*)
                     FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS x
                    GROUP BY id);
2
5
+ -- Query 6B
SELECT d2.id, d2.c, d2.d
  FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS d2
  JOIN (SELECT id
          FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS x
         GROUP BY id
        HAVING COUNT(*) = (SELECT COUNT(*)
                             FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS x
                            GROUP BY id)
       ) AS j2
    ON j2.id = d2.id
 ORDER BY id;
2|w|x
2|y|z
5|w|x
5|y|z
+ -- Query 7B
SELECT x.id, y.id, x.c, y.c, x.d, y.d
  FROM (SELECT d2.id, d2.c, d2.d
          FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS d2
          JOIN (SELECT id
                  FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS x
                 GROUP BY id
                HAVING COUNT(*) = (SELECT COUNT(*)
                                     FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS x
                                    GROUP BY id)
               ) AS j2
            ON j2.id = d2.id
       ) AS x
  JOIN (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS y
    ON x.c = y.c AND x.d = y.d
 ORDER BY x.id, y.id, x.c, x.d;
2|1|w|w|x|x
2|1|y|y|z|z
5|1|w|w|x|x
5|1|y|y|z|z
+ -- Query 8B
SELECT x.id
  FROM (SELECT d2.id, d2.c, d2.d
          FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS d2
          JOIN (SELECT id
                  FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk != 1) AS x
                 GROUP BY id
                HAVING COUNT(*) = (SELECT COUNT(*)
                                     FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS x
                                    GROUP BY id)
               ) AS j2
            ON j2.id = d2.id
       ) AS x
  JOIN (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS y
    ON x.c = y.c AND x.d = y.d
 GROUP BY x.id
HAVING COUNT(*) = (SELECT COUNT(*)
                     FROM (SELECT DISTINCT idfk AS id, c, d FROM t2 WHERE idfk  = 1) AS x
                    GROUP BY id);
2
5
+ ROLLBACK;

First Revised Answer 第一次修订答案

Step 1: IDs having same count of rows as ID = 1 步骤1:与ID = 1的行数相同的ID

SELECT idfk AS id -- Query 5
  FROM t2
 WHERE idfk != 1
 GROUP BY idfk
HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1);

Step 2: Data corresponding Query 5 步骤2:数据对应的查询5

SELECT idfk AS id, c, d -- Query 6
  FROM t2
  JOIN (SELECT idfk AS id
          FROM t2
         WHERE idfk != 1
         GROUP BY idfk
        HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1)
       ) AS j2
    ON j2.id = t2.idfk
 ORDER BY id;

Step 3: Join rows from Query 6 with rows for ID = 1 步骤3:将查询6中的行与ID = 1的行连接起来

SELECT x.id, y.id, x.c, y.c, x.d, y.d -- Query 7
  FROM (SELECT idfk AS id, c, d
          FROM t2
          JOIN (SELECT idfk AS id
                  FROM t2
                 WHERE idfk != 1
                 GROUP BY idfk
                HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1)
               ) AS j2
            ON j2.id = t2.idfk
       ) AS x
  JOIN (SELECT idfk AS id, c, d
          FROM t2 WHERE idfk = 1
       ) AS y
    ON x.c = y.c AND x.d = y.d
 ORDER BY x.id, y.id, x.c, x.d;

Step 4: IDs from Query 7 where the count is the same as the count for ID = 1 步骤4:来自查询7的ID,其计数与ID = 1的计数相同

SELECT x.id
  FROM (SELECT idfk AS id, c, d
          FROM t2
          JOIN (SELECT idfk AS id
                  FROM t2
                 WHERE idfk != 1
                 GROUP BY idfk
                HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1)
               ) AS j2
            ON j2.id = t2.idfk
       ) AS x
  JOIN (SELECT idfk AS id, c, d
          FROM t2 WHERE idfk = 1
       ) AS y
    ON x.c = y.c AND x.d = y.d
 GROUP BY x.id
HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1);

Example run 运行示例

The DBMS is IBM Informix Dynamic Server 11.70.FC2 running on Mac OS X 10.7.4, using SQLCMD v88.00 as the SQL command interpreter (no, not Microsoft's johnny-come-lately; the one I first wrote over twenty years ago). DBMS是在Mac OS X 10.7.4上运行的IBM Informix Dynamic Server 11.70.FC2,使用SQLCMD v88.00作为SQL命令解释器(不,不是Microsoft的johnny-come-lately;我二十多年前第一次写的那个) 。

+ BEGIN;
+ CREATE TABLE T1
(ID INTEGER NOT NULL PRIMARY KEY, a CHAR(1) NOT NULL, b CHAR(1) NOT  NULL);
+ INSERT INTO T1 VALUES(1, 'k', 'l');
+ INSERT INTO T1 VALUES(2, 'k', 'l');
+ INSERT INTO T1 VALUES(3, 'a', 'b');
+ INSERT INTO T1 VALUES(4, 'p', 'q');
+ CREATE TABLE T2
(IDFK INTEGER NOT NULL REFERENCES T1, c CHAR(1) NOT NULL, d CHAR(1) NOT  NULL);
+ INSERT INTO T2 VALUES(1, 'w', 'x');
+ INSERT INTO T2 VALUES(1, 'y', 'z');
+ INSERT INTO T2 VALUES(2, 'w', 'x');
+ INSERT INTO T2 VALUES(2, 'y', 'z');
+ INSERT INTO T2 VALUES(3, 'w', 'x');
+ INSERT INTO T2 VALUES(3, 'y', 'b');
+ INSERT INTO T2 VALUES(3, 'y', 'z');
+ INSERT INTO T2 VALUES(4, 'w', 'x');
+ SELECT t1.id AS id, t2.c, t2.d -- Query 1
  FROM t1
  JOIN t2 ON t1.id = t2.idfk;
1|w|x
1|y|z
2|w|x
2|y|z
3|w|x
3|y|b
3|y|z
4|w|x
+ -- Query 5 - IDs having same count of rows as ID = 1

SELECT idfk AS id
  FROM t2
 WHERE idfk != 1
 GROUP BY idfk
HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1);
2
+ SELECT idfk AS id, c, d
  FROM t2
  JOIN (SELECT idfk AS id
          FROM t2
         WHERE idfk != 1
         GROUP BY idfk
        HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1)
       ) AS j2
    ON j2.id = t2.idfk
 ORDER BY id;
2|w|x
2|y|z
+ SELECT x.id, y.id, x.c, y.c, x.d, y.d
  FROM (SELECT idfk AS id, c, d
          FROM t2
          JOIN (SELECT idfk AS id
                  FROM t2
                 WHERE idfk != 1
                 GROUP BY idfk
                HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1)
               ) AS j2
            ON j2.id = t2.idfk
       ) AS x
  JOIN (SELECT idfk AS id, c, d
          FROM t2 WHERE idfk = 1
       ) AS y
    ON x.c = y.c AND x.d = y.d
 ORDER BY x.id, y.id, x.c, x.d;
2|1|w|w|x|x
2|1|y|y|z|z
+ SELECT x.id
  FROM (SELECT idfk AS id, c, d
          FROM t2
          JOIN (SELECT idfk AS id
                  FROM t2
                 WHERE idfk != 1
                 GROUP BY idfk
                HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1)
               ) AS j2
            ON j2.id = t2.idfk
       ) AS x
  JOIN (SELECT idfk AS id, c, d
          FROM t2 WHERE idfk = 1
       ) AS y
    ON x.c = y.c AND x.d = y.d
 GROUP BY x.id
HAVING COUNT(*) = (SELECT COUNT(*) FROM t2 WHERE t2.idfk = 1);
2
+ ROLLBACK;

Original answer 原始答案

This at least elicited sufficient clarification of the question. 这至少引起了对该问题的充分澄清。

As far as I can tell, if you have a sub-query like: 据我所知,是否有子查询:

SELECT t1.id AS id, t2.c, t2.d  -- Query 1
  FROM t1
  JOIN t2 ON t1.id = t2.idfk

then you are looking for pairs of rows in the result set where the values in c and d are the same but the id values are different. 那么您要在结果集中寻找成对的行,其中cd中的值相同,但id值不同。 So, we write the main query based on that: 因此,我们基于此编写主查询:

SELECT j1.id, j2.id  -- Query 2
  FROM (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk
       ) AS j1
  JOIN (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk
       ) AS j2
    ON j1.c = j2.c AND j1.d = j2.d AND j1.id != j2.id

You can ensure you don't get both '1, 2' and '2, 1' by changing the != condition into either < or > . 您可以通过将!=条件更改为<>来确保不会同时获得“ 1、2”和“ 2、1”。

If you want the rows that match a specific ID value in T1, then you can specify it in a WHERE clause: 如果要与T1中的特定ID值匹配的行,则可以在WHERE子句中指定它:

SELECT j2.id  -- Query 3
  FROM (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk
       ) AS j1
  JOIN (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk
       ) AS j2
    ON j1.c = j2.c AND j1.d = j2.d AND j1.id != j2.id
 WHERE j1.id = 1;  -- 1 is the ID for which matches are sought

You can add conditions into the sub-queries if you wish (though a good optimizer might manage to do that for you): 您可以根据需要将条件添加到子查询中(尽管一个好的优化程序可能会为您做到这一点):

SELECT j2.id  -- Query 4
  FROM (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk AND t1.id = 1
       ) AS j1
  JOIN (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk AND t1.id != 1
       ) AS j2
    ON j1.c = j2.c AND j1.d = j2.d
 WHERE j1.id = 1;  -- 1 is the ID for which matches are sought

The third condition in the main ON clause was redundant since, by construction, the ID values in the j1 sub-query are all 1 and the ID values in the j2 sub-query are all 'not 1'. 主ON子句中的第三个条件是多余的,因为通过构造, j1子查询中的ID值都为1,而j2子查询中的ID值都为“非1”。


I fixed the issue with t2.id vs t2.idfk in the SQL, and I've run the 4 queries above. 我在SQL中解决了t2.id vs t2.idfk的问题,并运行了上面的4个查询。 Each produces the answer I'd expect. 每个都能产生我期望的答案。 There are two rows in the result set for, say, Query 4 because there are two pairs of rows in T1 such that both rows { 1, a , b } and { 2, a , b } exist in T2. 例如,查询4的结果集中有两行,因为T1中有两对行,因此T2中同时存在行{1, ab }和{2, ab }。 If you only want the 2 two appear once, despite there being many matching rows, then you'll need to apply a DISTINCT to the SELECT. 如果尽管有很多匹配的行,但只希望两个2出现一次,则需要将DISTINCT应用于SELECT。

In a comment, you say: 在评论中,您说:

Unfortunately it will still return results even if one of the attributes does not match. 不幸的是,即使其中一个属性不匹配,它仍然会返回结果。 How to match every single attribute in T2? 如何匹配T2中的每个属性?

That requires an extended data set to demonstrate. 这需要扩展的数据集进行演示。 When I added: 当我添加时:

INSERT INTO T1 VALUES(3, 'a', 'b');
INSERT INTO T2 VALUES(3, 'a', 'z');
INSERT INTO T2 VALUES(3, 'y', 'b');

The value 3 only appeared in the results of Query 1, which is the only place it should appear. 值3仅出现在查询1的结果中,这是它应该出现的唯一位置。

Please illustrate what you are seeing as the incorrect behaviour, showing the sample data. 请说明示例数据,以说明您看到的错误行为。 I tested the queries above with the following SQL and the interleaved query results. 我使用以下SQL和交错的查询结果测试了上面的查询。 The DBMS is IBM Informix Dynamic Server 11.70.FC2 running on Mac OS X 10.7.4, using SQLCMD v88.00 as the SQL command interpreter. DBMS是在Mac OS X 10.7.4上运行的IBM Informix Dynamic Server 11.70.FC2,使用SQLCMD v88.00作为SQL命令解释器。

+ BEGIN;
+ CREATE TEMP TABLE T1
(ID INTEGER NOT NULL PRIMARY KEY, A CHAR(1) NOT NULL, B CHAR(1) NOT  NULL);
+ INSERT INTO T1 VALUES(1, 'k', 'l');
+ INSERT INTO T1 VALUES(2, 'k', 'l');
+ INSERT INTO T1 VALUES(3, 'a', 'b');
+ CREATE TEMP TABLE T2
(IDFK INTEGER NOT NULL, C CHAR(1) NOT NULL, D CHAR(1) NOT  NULL);
+ INSERT INTO T2 VALUES(1, 'w', 'x');
+ INSERT INTO T2 VALUES(1, 'y', 'z');
+ INSERT INTO T2 VALUES(2, 'w', 'x');
+ INSERT INTO T2 VALUES(2, 'y', 'z');
+ INSERT INTO T2 VALUES(3, 'a', 'z');
+ INSERT INTO T2 VALUES(3, 'y', 'b');
+ SELECT t1.id AS id, t2.c, t2.d -- Query 1
  FROM t1
  JOIN t2 ON t1.id = t2.idfk;
1|w|x
1|y|z
2|w|x
2|y|z
3|a|z
3|y|b
+ SELECT j1.id, j2.id -- Query 2
  FROM (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk
       ) AS j1
  JOIN (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk
       ) AS j2
    ON j1.c = j2.c AND j1.d = j2.d AND j1.id != j2.id;
1|2
1|2
2|1
2|1
+ SELECT j2.id -- Query 3
  FROM (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk
       ) AS j1
  JOIN (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk
       ) AS j2
    ON j1.c = j2.c AND j1.d = j2.d AND j1.id != j2.id
 WHERE j1.id = 1;
2
2
+ SELECT j2.id  -- Query 4
  FROM (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk AND t1.id = 1
       ) AS j1
  JOIN (SELECT t1.id AS id, t2.c, t2.d
          FROM t1
          JOIN t2 ON t1.id = t2.idfk AND t1.id != 1
       ) AS j2
    ON j1.c = j2.c AND j1.d = j2.d
 WHERE j1.id = 1;
2
2
+ ROLLBACK;

My approach to this is to combine the two columns of attributes into single values using group_concat. 我的方法是使用group_concat将两列属性组合为单个值。 I can then easily find all ids that have the same attributes, and return these as the attributes. 然后,我可以轻松找到具有相同属性的所有id,并将它们作为属性返回。

select allts.id
from (select group_concat(c separator ';' order by c) as allcs,
             group_concat(d separator ';' order by d) as allds
      from t2
      where t2.id = 1
     ) t2_1 join
     (select t2.id, group_concat(c separator ';' order by c) as allcs,
             group_concat(d separator ';' order by d) as allds
      from t2
      group by t2.id
     ) allts
     on t2_1.allcs = allts.allcs and t2_1.allds = t2_1.allds join

This version does not take into account any information in t1. 此版本未考虑t1中的任何信息。 Your question only mentioned the attributes in t2. 您的问题仅提到了t2中的属性。

Lets start by defining the attribute data which represents the input ID: 让我们先定义代表输入ID的属性数据:

select idfk, c, d
from t2
where idfk = @ID

Now we can use this information to select potential matches existing in T2, where IDFK isnt @ID: 现在,我们可以使用此信息来选择T2中存在的潜在匹配项,其中IDFK不是@ID:

select x.idfk, x.c, x.d y.idfk as id2 from (
    select idfk, c, d
    from t2
    where idfk = @ID
) x left join t2 y on x.c = y.c and x.d = y.d
where y.idfk <> @ID 
  and y.idfk is not null

Data is the second query is a suitable match to data in the first query if the count of rows for each value of id2 is the same as the count of rows from the first query. 如果id2每个值的行数与第一个查询的行数相同,则第二个查询的数据与第一个查询中的数据是合适的匹配。

Hence: 因此:

select id2 from ( 
    select id2, count(*) as rowcount from (
        <second query>
    ) z
) rowsByID
where rowcount = (select count(*) from (<first query>) IDattributes)

I'm uncertain whether you intend that returned rows must match on A & B as well, or just on the data in Table 2, but if I assume they must match on A & B, then: 我不确定您是否打算返回的行也必须在A&B上匹配,还是仅在表2中的数据上匹配,但是如果我假设它们必须在A&B上匹配,那么:

select ID from t1 
    join <third query> m on t1.id = m.id2
    join (select a, b from t1 where id = @id) prime_row on t1.a = prime_row.a and t1.b = prime_row.b

if you dont need A & B to match, drop the second join. 如果不需要A和B匹配,请删除第二个联接。

How's this? 这个怎么样?

This is probably going to be incredibly slow on a large table. 在大桌子上,这可能会变得非常慢。 (Edit: I now know that full joins are not available with mysql; but the first query is still valid for other systems and potentially a little easier to understand. Skip to the second one if you don't care about it.) (编辑:我现在知道mysql无法使用完全连接;但是第一个查询对其他系统仍然有效,并且可能更容易理解。如果您不关心它,请跳到第二个查询。)

I used a question mark as the parameter marker. 我使用问号作为参数标记。 All should receive the same value of the "duplicate" id being matched. 所有人都应收到相同的“重复” ID值,该值将被匹配。 Add the condition and T.id <> ? 添加条件and T.id <> ? to exclude the matching row from the result set. 从结果集中排除匹配的行。 (I had thought OP wanted both rows 1 and 2.) tX represents the search space so it could also be excluded there and eliminated earlier in the process. (我曾以为OP希望行1和2都行。)tX代表搜索空间,因此也可以将其排除在外,并在此过程的早期将其消除。

select *
from T1 as T
where T.id in (
    select coalesce(attrR.idfk, tX.id)
    from
        T1 as tX
        cross join
        (select * from T2 where T2.idfk = ?) as attrL
        full outer join T2 as attrR
            on      attrR.idfk = tX.id
                and attrR.c = attrL.c
                and attrR.d = attrL.d
    group by coalesce(attrR.idfk, tX.id)
    having count(*) =
        sum(case
                when attrR.c = attrL.c and attrR.d = attrL.d
                then 1 else 0
            end
        )
);

This gets around the lack of full outer join . 这可以避免缺少full outer join

select *
from T1 as T
where T.id in (
    select attrR.idfk
    from
        T1 as tX
        cross join
        (select * from T2 where idfk = ?) as attrL
        right outer join
        T2 as attrR
            on      attrR.idfk = tX.id
                and attrR.c = attrL.c
                and attrR.d = attrL.d
        cross join
        (select count(*) as cnt from T2 where idfk = ?) as tC
    group by attrR.idfk
    having
        sum(case
                when attrR.c = attrL.c and attrR.d = attrL.d
                then 1 else 1000000
            end
        ) = min(tC.cnt)
);

This compound check is equivalent to the sum(case...) expression. 此复合检查等效于sum(case...)表达式。 One might feel better than the other. 一个人可能比另一个人感觉更好。

    having
            count(attrL.idfk) = min(tC.cnt)
        and count(*) = min(tC.cnt)

The first and second query I provided do work, but only if each T1 has at least one attribute in T2. 我提供的第一个和第二个查询有效,但前提是每个T1在T2中至少具有一个属性。 Here's a version that compensates by adding a dummy attribute to prevent empty sets in the intermediate results that mess it up. 这是一个通过添加虚拟属性进行补偿的版本,以防止中间结果中的空集将其弄乱。 It's uglier so don't use it if not necessary for that case. 这很丑陋,因此在这种情况下不需要时不要使用它。 (The full join version would need similar adjustments.) (完全连接版本将需要进行类似的调整。)

select *
from T1 as T
where T.id in (
    select attrR.idfk
    from
        T1 as tX
        cross join
        (
            select c, d from T2 where idfk = ?
            union all
            select '!@#$%', '' -- add a dummy attribute
        ) as attrL
        right outer join
        (
            select idfk, c, d from T2
            union all
            select id, '!@#$%', '' from T1
        ) as attrR
            on      attrR.idfk = tX.id
                and attrR.c = attrL.c
                and attrR.d = attrL.d
        cross join
        (select count(*)+1 as cnt from T2 where idfk = ?) as tC -- note the +1
    group by attrR.idfk
    having
            count(tX.id) = min(tC.cnt)
        and count(*) = min(tC.cnt)
);

If by chance you will port your system to Postgresql, you can use FULL JOIN: http://www.sqlfiddle.com/#!1/1f0ef/1 如果有机会将系统移植到Postgresql,则可以使用FULL JOIN: http ://www.sqlfiddle.com/#!1/1f0ef /1

with headers_matches as
(
    select x.*
    from t join t x using(a,b)
    where t.id = 1 and x.id <> 1
)
,cp as
(
    select t.id as idFk, x.*
    from t cross join (select c, d from u where idFk = 1) as x
    where t.id <> 1
)
,details_matches as
(
    select coalesce(cp.idFk,u.idFk) as idFk
    from cp
    full join (select * from u where idFk <> 1) u using(idfk,c,d)
    group by idFk
    having every(cp.idFk is not null and u.idFk is not null)
)
select h.* 
from headers_matches h
join details_matches d on d.idFk = h.id 
order by h.id;

Output for filter ID == 1: 过滤器ID == 1的输出:

| ID | A | B |
--------------
|  2 | k | l |
|  5 | k | l |

From these inputs: 从这些输入中:

CREATE TABLE t
    (ID int, A varchar(1), B varchar(1));

INSERT INTO t
    (ID, A, B)
VALUES
    (1, 'k', 'l'),
    (2, 'k', 'l'),
    (3, 'k', 'l'),
    (4, 'k', 'l'),
    (5, 'k', 'l'),
    (6, 'k', 'j');



CREATE TABLE u
    (IDFK int, C varchar(1), D varchar(1));

INSERT INTO u
    (IDFK, C, D)
VALUES
    (1, 'w', 'x'),
    (1, 'y', 'z'),

    (2, 'w', 'x'),
    (2, 'y', 'z'),

    (3, 'w', 'x'),
    (3, 'y', 'z'),
    (3, 'm', 'z'),

    (4, 'w', 'x'),

    (5, 'w', 'x'),
    (5, 'y', 'z'),

    (6, 'w', 'x'),
    (6, 'y', 'z');

How it works 这个怎么运作

We do the hardest part first, which is the details. 我们首先要做最难的部分,那就是细节。 We'll do header on latter part of this answer. 我们将在此答案的后半部分做标题。

How it works, first we need to cross populate the details so we can do a proper full join on the details, so the gaps can be detected later: 它是如何工作的,首先我们需要对细节进行交叉填充,以便我们可以对细节进行适当的完全连接,以便稍后可以检测到差距:

with cp as -- cross populate
(
    select t.id as idFk, x.*
    from t cross join (select c, d from u where idFk = 1) as x
    where t.id <> 1
)
select *
from cp;

Output: 输出:

| IDFK | C | D |
----------------
|    2 | w | x |
|    2 | y | z |
|    3 | w | x |
|    3 | y | z |
|    4 | w | x |
|    4 | y | z |
|    5 | w | x |
|    5 | y | z |
|    6 | w | x |
|    6 | y | z |

Then from that cross-populated detail, we can do the proper FULL JOIN: 然后,从该交叉填充的细节中,我们可以执行适当的FULL JOIN:

with cp as 
(
    select t.id as idFk, x.*
    from t cross join (select c, d from u where idFk = 1) as x
    where t.id <> 1
)
select 
    cp.idFk as cpIdFk, cp.c as cpC, cp.d as cpD,
    u.idFk as uFk, u.c as uC, u.d as Ud
from cp
full join (select * from u where idFk <> 1) u using(idfk,c,d);

Output: 输出:

| CPIDFK |    CPC |    CPD |    UFK |     UC |     UD |
-------------------------------------------------------
|      2 |      w |      x |      2 |      w |      x |
|      2 |      y |      z |      2 |      y |      z |
| (null) | (null) | (null) |      3 |      m |      z |
|      3 |      w |      x |      3 |      w |      x |
|      3 |      y |      z |      3 |      y |      z |
|      4 |      w |      x |      4 |      w |      x |
|      4 |      y |      z | (null) | (null) | (null) |
|      5 |      w |      x |      5 |      w |      x |
|      5 |      y |      z |      5 |      y |      z |
|      6 |      w |      x |      6 |      w |      x |
|      6 |      y |      z |      6 |      y |      z |

With that information at hand, we can now do the proper logic for detecting if there's a gap between the two sets, from the set above, we can see that those that have no gaps are #2, #5 and #6. 有了这些信息,我们现在可以执行适当的逻辑来检测两组之间是否存在间隙,从上面的组中,我们可以看到没有间隙的是#2,#5和#6。 For that we do this query: 为此,我们执行以下查询:

with cp as
(
    select t.id as idFk, x.*
    from t cross join (select c, d from u where idFk = 1) as x
    where t.id <> 1
)
,details_matches as
(
    select coalesce(cp.idFk,u.idFk) as idFk
    from cp
    full join (select * from u where idFk <> 1) u using(idfk,c,d)
    group by idFk
    having every(cp.idFk is not null and u.idFk is not null)
)
select * from details_matches
order by idFk;

Output: 输出:

| IDFK |
--------
|    2 |
|    5 |
|    6 |

Then now we do the header comparision, which is easier: 然后,我们进行标头比较,这比较容易:

with headers_matches as
(
    select x.*
    from t join t x using(a,b)
    where t.id = 1 and x.id <> 1
)
select * from headers_matches;

That should return header #2, #3, #4, #5 as they are identical to #1's header values: 应该返回标头#2,#3,#4,#5,因为它们与#1的标头值相同:

Output: 输出:

| ID | A | B |
--------------
|  2 | k | l |
|  3 | k | l |
|  4 | k | l |
|  5 | k | l |

Finally, we combine the two queries: 最后,我们结合两个查询:

with headers_matches as
(
    select x.*
    from t join t x using(a,b)
    where t.id = 1 and x.id <> 1
)
,cp as
(
    select t.id as idFk, x.*
    from t cross join (select c, d from u where idFk = 1) as x
    where t.id <> 1
)
,details_matches as
(
    select coalesce(cp.idFk,u.idFk) as idFk
    from cp
    full join (select * from u where idFk <> 1) u using(idfk,c,d)
    group by idFk
    having every(cp.idFk is not null and u.idFk is not null)
)
select h.* 
from headers_matches h
join details_matches d on d.idFk = h.id 
order by h.id;

Output: 输出:

| ID | A | B |
--------------
|  2 | k | l |
|  5 | k | l |

See the query progression here: http://www.sqlfiddle.com/#!1/1f0ef/1 在此处查看查询进度: http : //www.sqlfiddle.com/#!1/1f0ef/1

I'll convert the Postgresql query to Mysql later. 稍后我将把Postgresql查询转换为Mysql。

UPDATE 更新

Here's the MySQL version: Find duplicates across multiple tables 这是MySQL版本: 在多个表中查找重复项

I've thought through your comments re my previous answer, and would propose a different approach. 我认为您的意见是我以前的回答,因此会提出一种不同的方法。

select idfk, c, d from t2 where idfk = @ID 

this query identifies all the attribute-sets for @ID. 此查询标识@ID的所有属性集。 Suppose we put this into a temporary table, then for EACH row in this table, identify all IDFKs in T2, where IDFK <> @ID, which match on all attribute values with the source row; 假设我们将其放入一个临时表中,然后针对该表中的每个行,标识T2中的所有IDFK,其中IDFK <> @ID,该IDFK在所有属性值上与源行匹配; Put all these rows into a new table. 将所有这些行放入一个新表。

My sql to do this would be: (you may need to adapt this for mysql) 我的SQL做到这一点是:(您可能需要对此进行调整以适合MySQL)

create table #attribs (row# int, c, d);
insert #attribs (row#, c, d) values (0, null, null);

insert #attribs (row#, c, d)
select (select max row# from #attribs) + 1, c, d
from T2 where idfk = @ID;

delete #attribs where row# = 0;

create table #matchedattrib (idfk int)

while (select count(*) from #attribs) > 0 begin
    select @c = c, @d = d from #attribs where row# = (select min(row#) from #attribs);
    delete #attribs where row# = (select min(row#) from #attribs);

    insert #matchedattrib (idfk)
    select idfk from T2 where idfk <> @ID and T2.c = @c and T2.d = @d;
end

Having done this, any IDFK in this newer table, with the same number of rows as there are attribute sets for @ID (first query) has all the attributes of @ID. 完成此操作后,此新表中具有与@ID(第一个查询)的属性集相同的行数的任何IDFK都具有@ID的所有属性。

select idfk, count(*) as tot_attribs
into #counts
from #matchedattrib
group by idfk
having count(*) = (select count(*) from (select idfk from T2 where idfk = @ID) x);

However, as you pointed out re my previous answer, these IDFKs could have other attributes as well, so then for IDFKs with the correct number of rows in the second table, you need to count that the rows existing for them in T2 is this same number - to verify that these matching attributes are in fact all the attributes for that IDFK - meaning a total match on attributes. 但是,正如您之前指出的那样,这些IDFK也可以具有其他属性,因此对于第二张表中具有正确行数的IDFK,您需要计算出T2中为它们存在的行是相同的数字-验证这些匹配属性实际上是该IDFK的所有属性-表示属性完全匹配。

select idfk from #counts
where tot_attribs = (select count(*) from T2 where idfk = #counts.idfk)

If you also need to match on A + B, you'll have to fill that in yourself! 如果您还需要在A + B上进行匹配,则必须自己填写!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM