简体   繁体   English

SQL查询找到匹配项

[英]SQL query to find matches

I have written a query to provide matches with the same DB and it's giving me expected results except that I don't get few part of it. 我编写了一个查询来为同一数据库提供匹配,这给了我预期的结果,只是我得到的部分很少。 Below is the query : 下面是查询:

select f.name, f.id, f.industry, d.name, d.id, d.industry
from product_table f, product_table d
where (f.name like '%' || d.name || '%') and 
(f.industrylike '%' || d.industry|| '%') and

I know by providing this it's actually looking for matches between the 2 columns : 我知道通过提供它实际上是在寻找两列之间的匹配:

(..... like '%' || ..... || '%')

But what does each part of it do exactly and what does it mean? 但是它的每个部分到底是做什么的,这意味着什么?

This query is executing a self-join (here, a cross self-join) in which we query two instances of the same table for some purpose. 此查询正在执行自联接(此处为交叉自联接),出于某种目的,我们在该自联接中查询同一表的两个实例。 In this case it looks like some form of data quality exercise, where we suspect we might have almost duplicate records. 在这种情况下,它看起来像某种形式的数据质量练习,我们怀疑其中可能有几乎重复的记录。 That is, we think we have records for the same combination of (product name and industry). 也就是说,我们认为我们拥有(产品名称和行业)相同组合的记录。 The use of wild cards will identify records where the value of one column is wholly embedded in another column: for instance '%STACK%' matches 'META STACKOVERFLOW' . 使用通配符将标识一列的值完全嵌入另一列的记录:例如'%STACK%''META STACKOVERFLOW'匹配。

The posted version has a potential flaw, in that if there are two records with an exact match you will get two hits (one for F:D, one for D:F). 发布的版本具有潜在的缺陷,即如果有两个完全匹配的记录,您将获得两个匹配(一个针对F:D,一个针对D:F)。 You can finagle that by adding a filter on id 您可以通过在id上添加过滤器来欺骗它

select f.name, f.id, f.industry, 
       d.name, d.id, d.industry
from product_table f, product_table d
where (f.name like '%' || d.name || '%') 
and (f.industrylike '%' || d.industry|| '%') 
and  ( ( f.name = d.name 
        and f.industry = d.industry 
        and f.id < d.id )
    or f.name != d.name 
    or f.industry != d.industry 
    )

The double vertical bar (more commonly known as a pipe) is the concatenation operator. 双竖线(通常称为管道)是串联运算符。 It is used for joining strings together. 它用于将字符串连接在一起。 (Many programming languages use + but Oracle reserves that strictly for arithmetic on numbers.) (许多编程语言都使用+但Oracle严格保留该数字,以便进行数字运算。)

not so much clear on why we put it before and after only the second column : f.name like '%' || 对于为什么我们只在第二列之前和之后加上它,我们还不太清楚:f.name如'%'|| d.name || d.name || '%' '%'

In this case, the query is concatenating a wild card. 在这种情况下,查询将连接通配符。 Given this value for f.name = 'XYZ' , we would get matches for '%' || d.name || '%' 给定f.name = 'XYZ' ,我们将获得'%' || d.name || '%'匹配项 '%' || d.name || '%' '%' || d.name || '%' on: '%' || d.name || '%'

  • '1XYZ1'
  • '11XYZ11'
  • '11XYZ'
  • 'XYZ1'
  • 'XYZ' <---- matching same record 'XYZ' <----匹配相同记录

We don't need to wrap f.name in wildcard operators because the query is a self-join so all the values of name will appear on the left hand side of the filter. 我们不需要在通配符运算符中包装f.name ,因为查询是一个自f.name ,因此name所有值都将出现在过滤器的左侧。 When f.name = '1XYZ1' it match for '%' || d.name || '%' 如果f.name = '1XYZ1'则匹配'%' || d.name || '%' '%' || d.name || '%' '%' || d.name || '%' on: '%' || d.name || '%'

  • '1XYZ1' <---- matching same record '1XYZ1' <----匹配相同记录
  • 'XYZ1'
  • 'XYZ'

So you're going to get multiple hits already. 因此,您将已经获得多个点击。 Embedding both sides of the filter in wildcards will only generate more noisy duplicates. 将过滤器的两侧嵌入通配符只会产生更多的噪音重复项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM