简体   繁体   English

获取具有匹配子记录的2个表之间的ID对

[英]Get ID pairs between 2 tables with matching child records

I have 2 tables with the same structure. 我有2个具有相同结构的表。

FIELD 1      INT
FIELD 2      VARCHAR(32)   -- is a MD5 Hash

The query has to get matching FIELD 1 pairs from for records that have the exact combination of values for FIELD 2 in both TABLE 1 and TABLE 2. 对于表1和表2中具有FIELD 2值的确切组合的记录,查询必须从中获得匹配的FIELD 1对。

These tables are pretty large ( 1 million records between the two ) but are deduced down to an ID and a Hash. 这些表非常大(两个表之间有1百万条记录),但可以推导出为ID和哈希值。

Example data: 示例数据:

TABLE 1 表格1

1     A
1     B
2     A
2     D
2     E
3     G
3     H
4     E
4     D
4     C
5     E
5     D

TABLE 2 表2

8     A
8     B
9     E
9     D
9     C
10    F
11    G
11    H
12    B
12    D
13    A
13    B
14    E
14    A
The results of the query should be 查询结果应为
\n8 1 8 1\n9 4 9 4\n11 3 11 3\n13 1 13 1\n

I have tried creating a concatenated string of FIELD 2 using a correlated sub-query and FOR XML PATH string trick I read on here but that is very slow. 我试图使用相关的子查询和FOR XML PATH字符串技巧创建一个FIELD 2的串联字符串,但我在这里很慢。

You can try following query also - 您也可以尝试以下查询-

SELECT t_2.Field_1, t_1.Field_1                          --1
  FROM table_1 t_1, table_2 t_2                          --2
 WHERE t_1.Field_2 = t_2.Field_2                         --3
 GROUP BY t_1.Field_1, t_2.Field_1                       --4
HAVING COUNT(*) = (SELECT COUNT(*)                       --5
                     FROM Table_1 t_1_1                  --6
                    WHERE t_1_1.Field_1 = t_1.Field_1)   --7
   AND COUNT(*) = (SELECT COUNT(*)                       --8
                     FROM Table_2 t_2_1                  --9
                    WHERE t_2_1.Field_1 =t_2.Field_1)    --10

Edit 编辑

First the requested set of result is the combination of Field1 from both the tables where respective Field2 is exactly same. 首先,请求的结果集是两个表中Field1完全相同的两个表的组合。

so for that you can use one method which I have posted above. 因此,您可以使用我上面发布的一种方法。

Here query will take the data from both the table based on field2 values (from line 1 to line 3) then it will group the data based on field1 from table1 and field1 from table2 (line 4) 在这里查询将基于field2值(从第1行到第3行)从两个表中获取数据,然后将基于table1的field1和table2的field1(第4行)对数据进行分组

till this step you will get the result having field1 from table1 and field2 from table2 where it exists (at least one) matching based on field2 from tables for respective field1 values. 直到这一步,您将获得具有table1的field1和table2的field2的结果,该结果存在(至少一个)基于各个field1值的表中的field2进行匹配。

after this you just need to filter the result for correct (exactly same values for field2 values for respective field1 column value). 之后,您只需要过滤结果就可以了(相应的field1列值的field2值完全相同)。 so that you can make condition on row count. 这样就可以使行数成为条件。

here my assumption is that you don't have multiple values for field1 and field2 combination in either tables 在这里我的假设是您在两个表中都没有field1和field2组合的多个值

means following rows will not be present - 表示以下行将不存在-

1 b 1 b 1 b 1 b

In any of the tables. 在任何表中。

if so, the rows count got for table1 and table2 for same field2 values should be match with the rows present in table1 for field1 and same rows only should present in tables2 for field2 value. 如果是这样,则针对相同field2值的table1和table2获得的行数应与针对field1的table1中存在的行相匹配,并且针对field2值仅应在table2中存在相同的行。

for this condition query has condition on count(*) in having clause (from line 5 to line 10). 对于此条件查询,在having子句中(从第5行到第10行count(*)count(*)具有条件。

Let me try to explain this version of the query: 让我尝试解释此版本的查询:

select t1.field1 as t1field1, t2.field1 as t2field1
from (select t1.*,
             count(*) over (partition by field1) as NumField2
      from table1 t1
     ) t1 full outer join
     (select t2.*,
             count(*) over (partition by field1) as NumField2
      from table2 t2
     ) t2
     on t1.field2 = t2.field2
where t1.NumField2 = t2.NumField2
group by t1.Field1, t2.Field1
having count(t1.field2) = max(t1.NumField2) and
       count(t2.field2) = max(t2.NumField2)

(which is here at SQLFiddle). (这是这里的SQLFiddle)。

The idea is to compare the following counts for each pair of field1 values. 想法是比较每对field1值的以下计数。

  1. The number of field2 values on each. 每个字段上的field2值的数量。
  2. The number of field2 values that they share. 它们共享的field2值的数量。

All of these have to be equal. 所有这些必须相等。

Each subquery counts the number of values of field2 on each field1 value. 每个子查询在每个field1值上计算field2值数。 For the first rows of your data, this produces: 对于数据的第一行,将产生:

1    A    2
1    B    2
2    A    3
2    D    3
2    E    3
. . .

And for the second table 对于第二张桌子

8    A    2
8    B    2
9    E    3
9    D    3
9    C    3

Next, the full outer join is applied, requiring a match on both the count and the field2 value. 接下来,应用full outer join ,要求计数和field2值都匹配。 This multiplies the data, producing rows such as: 这将数据相乘,产生诸如以下的行:

1    A    2    8    A    2
1    B    2    8    B    2
2    A    3    NULL NULL NULL
2    D    3    9    D    3
2    E    3    9    E    3
NULL NULL NULL 9    C    3

And so on for all the possible combinations. 对于所有可能的组合,依此类推。 Note that the NULL s appear due to the full outer join . 请注意,由于full outer join而出现NULL

Note that when you have a pair, such as 1 and 8 that match, there are no rows with NULL values. 请注意,如果有一对匹配,例如1和8匹配,则没有行具有NULL值。 When you have a pair with the same counts but they don't match, then you have NULL values. 如果您有一对计数相同但不匹配的对,则您将具有NULL值。 When you have a pair with different counts, they are filtered out by the where clause. 当您有一对计数不同的对时,它们会被where子句过滤掉。

The filtering aggregation step applies these rules to get pairs that meet the first condition but not the second. 过滤聚合步骤应用这些规则来获得满足第一个条件但不满足第二个条件的对。

The having essentially removes any pair that has NULL values. having基本上除去具有任何一对NULL值。 When you count() a column, NULL values are not included. 当您count()列时,不包括NULL值。 In that case, the count() on the column is fewer than the number of values expected ( NumField2 ). 在这种情况下,列上的count()小于期望值的数量( NumField2 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM