[英]How do I write a sql query which returns records which have similar names
I have a database of many thousands of companies.我有一个包含数千家公司的数据库。 The issue I have is that some of these employers are duplicates - however - they don't have exactly the same name (otherwise this would be quite easy to solve).
我遇到的问题是其中一些雇主是重复的 - 但是 - 他们的名字并不完全相同(否则这很容易解决)。
So we have companies like 'Wine Ltd', 'wine', 'wine holdings ltd', 'wine limited'.所以我们有像“Wine Ltd”、“wine”、“wine holdings ltd”、“wine limited”这样的公司。 These companies may not be the same - however, I want to create a table which shows all of these similar companies so I can make the decision myself (and don't have to go through all of the records).
这些公司可能不一样 - 但是,我想创建一个表格来显示所有这些相似的公司,这样我就可以自己做出决定(并且不必 go 通过所有记录)。
I am using PostgreSQL我正在使用 PostgreSQL
I have already used a query which searches for the first word of a company我已经使用了一个查询来搜索公司的第一个词
eg例如
select * from company where name like 'FIRSTWORD%' select * 来自名称如“FIRSTWORD%”的公司
But obviously this only helps me one employer at a time and this will take me many hours.但很明显,这一次只能帮助我一个雇主,而且这会花费我很多时间。
You could probably use the pg_trgm extension .您或许可以使用pg_trgm 扩展名。
select * from company a join company b on a.name % b.name and a.id < b.id
You will need some way to mark a pairing as already evaluated and found to be actually different, otherwise you will just keep reviewing the same proposed pairings over and over again.您将需要一些方法来将配对标记为已评估并发现实际上不同,否则您将一遍又一遍地检查相同的建议配对。
I'm not 100% sure if I understand your question correct.我不是 100% 确定我是否理解你的问题是正确的。 So if I'm on the wrong track, please add more details to your question and explain the exact logic to find out whether strings are "similar".
因此,如果我走错了路,请在您的问题中添加更多详细信息并解释确切的逻辑以找出字符串是否“相似”。
As the question currently stands and as I read your requirements, it seems like you think strings are similar if they start with the same string.正如目前的问题以及我阅读您的要求时,您似乎认为如果字符串以相同的字符串开头,则它们是相似的。
In your example, all of those "similar" companies start with the same 4 letters .在您的示例中,所有这些“相似”公司都以相同的 4 个字母开头。
So let's assume this can be used as general rule in the query you are looking for.因此,让我们假设这可以用作您要查找的查询中的一般规则。
In this case, we can use STRING_AGG
to build a comma-separated list of similar companies and GROUP BY
their first 4 letters.在这种情况下,我们可以使用
STRING_AGG
构建类似公司的逗号分隔列表,并按其前 4 个字母进行GROUP BY
。 To find the first 4 letters, we can use LEFT
and LOWER
( to ignore their case).要找到前 4 个字母,我们可以使用
LEFT
和LOWER
(忽略它们的大小写)。
So the query could be something like this:所以查询可能是这样的:
SELECT LEFT(LOWER(name),4) AS commonPart,
STRING_AGG (name, ',') AS companies
FROM company
GROUP BY LEFT(LOWER(name),4)
ORDER BY LEFT(LOWER(name),4);
If we want furthermore to exclude such common strings that occur less than 2 times, less than 4 times or any other number, we can add a HAVING
clause.如果我们还想排除出现次数少于 2 次、少于 4 次或任何其他次数的常见字符串,我们可以添加一个
HAVING
子句。
Here the example to only fetch such "similar" companies where a "group" of companies includes at least 4 companies:这里的示例仅获取此类“相似”公司,其中“一组”公司至少包括 4 家公司:
SELECT LEFT(LOWER(name),4) AS commonPart,
STRING_AGG (name, ',') AS companies
FROM company
GROUP BY LEFT(LOWER(name),4)
HAVING COUNT(*) > 3
ORDER BY LEFT(LOWER(name),4);
Here we can try out this idea with some sample data: db<>fiddle在这里我们可以用一些示例数据来尝试这个想法: db<>fiddle
It should be clear such a query will not work perfectly because for example a company "cold water" would not be matched with "water cold".应该清楚这样的查询不会完美地工作,因为例如公司“冷水”不会与“水冷”相匹配。 To cover all such similarities will be extremely difficult if even possible.
即使有可能,要涵盖所有这些相似之处也是极其困难的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.