简体   繁体   English

我如何编写 sql 查询返回具有相似名称的记录

[英]How do I write a sql query which returns records which have similar names

I have a database of many thousands of companies.我有一个包含数千家公司的数据库。 The issue I have is that some of these employers are duplicates - however - they don't have exactly the same name (otherwise this would be quite easy to solve).我遇到的问题是其中一些雇主是重复的 - 但是 - 他们的名字并不完全相同(否则这很容易解决)。

So we have companies like 'Wine Ltd', 'wine', 'wine holdings ltd', 'wine limited'.所以我们有像“Wine Ltd”、“wine”、“wine holdings ltd”、“wine limited”这样的公司。 These companies may not be the same - however, I want to create a table which shows all of these similar companies so I can make the decision myself (and don't have to go through all of the records).这些公司可能不一样 - 但是,我想创建一个表格来显示所有这些相似的公司,这样我就可以自己做出决定(并且不必 go 通过所有记录)。

I am using PostgreSQL我正在使用 PostgreSQL

I have already used a query which searches for the first word of a company我已经使用了一个查询来搜索公司的第一个词

eg例如

select * from company where name like 'FIRSTWORD%' select * 来自名称如“FIRSTWORD%”的公司

But obviously this only helps me one employer at a time and this will take me many hours.但很明显,这一次只能帮助我一个雇主,而且这会花费我很多时间。

You could probably use the pg_trgm extension .您或许可以使用pg_trgm 扩展名

select * from company a join company b on a.name % b.name and a.id < b.id

You will need some way to mark a pairing as already evaluated and found to be actually different, otherwise you will just keep reviewing the same proposed pairings over and over again.您将需要一些方法来将配对标记为已评估并发现实际上不同,否则您将一遍又一遍地检查相同的建议配对。

I'm not 100% sure if I understand your question correct.我不是 100% 确定我是否理解你的问题是正确的。 So if I'm on the wrong track, please add more details to your question and explain the exact logic to find out whether strings are "similar".因此,如果我走错了路,请在您的问题中添加更多详细信息并解释确切的逻辑以找出字符串是否“相似”。

As the question currently stands and as I read your requirements, it seems like you think strings are similar if they start with the same string.正如目前的问题以及我阅读您的要求时,您似乎认为如果字符串以相同的字符串开头,则它们是相似的。

In your example, all of those "similar" companies start with the same 4 letters .在您的示例中,所有这些“相似”公司都以相同的 4 个字母开头

So let's assume this can be used as general rule in the query you are looking for.因此,让我们假设这可以用作您要查找的查询中的一般规则。

In this case, we can use STRING_AGG to build a comma-separated list of similar companies and GROUP BY their first 4 letters.在这种情况下,我们可以使用STRING_AGG构建类似公司的逗号分隔列表,并按其前 4 个字母进行GROUP BY To find the first 4 letters, we can use LEFT and LOWER ( to ignore their case).要找到前 4 个字母,我们可以使用LEFTLOWER (忽略它们的大小写)。

So the query could be something like this:所以查询可能是这样的:

SELECT LEFT(LOWER(name),4) AS commonPart, 
STRING_AGG (name, ',') AS companies
FROM company
GROUP BY LEFT(LOWER(name),4)
ORDER BY LEFT(LOWER(name),4);

If we want furthermore to exclude such common strings that occur less than 2 times, less than 4 times or any other number, we can add a HAVING clause.如果我们还想排除出现次数少于 2 次、少于 4 次或任何其他次数的常见字符串,我们可以添加一个HAVING子句。

Here the example to only fetch such "similar" companies where a "group" of companies includes at least 4 companies:这里的示例仅获取此类“相似”公司,其中“一组”公司至少包括 4 家公司:

SELECT LEFT(LOWER(name),4) AS commonPart, 
STRING_AGG (name, ',') AS companies
FROM company
GROUP BY LEFT(LOWER(name),4)
HAVING COUNT(*) > 3
ORDER BY LEFT(LOWER(name),4);

Here we can try out this idea with some sample data: db<>fiddle在这里我们可以用一些示例数据来尝试这个想法: db<>fiddle

It should be clear such a query will not work perfectly because for example a company "cold water" would not be matched with "water cold".应该清楚这样的查询不会完美地工作,因为例如公司“冷水”不会与“水冷”相匹配。 To cover all such similarities will be extremely difficult if even possible.即使有可能,要涵盖所有这些相似之处也是极其困难的。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何编写查询以查找具有相似列的表? - How to write a query to find tables which have similar columns? 如何编写传递给其他SQL查询的游标? - How do I write cursor which is passed to other sql query? 如何优化SQL查询以返回数百万条记录 - How to optimize SQL query which returns millions of records 如何编写查询以查找不属于所选字段的记录? - How to write a query to find records which do not belong the field selected? 返回在表中出现 X 次的记录的 SQL 查询 - SQL Query which returns records which appear X times in table 如何更改SQL SELECT GROUP BY查询以显示哪些记录缺少值? - How do I change my SQL SELECT GROUP BY query to show me which records are missing a value? SQL 服务器我们如何编写 sql function 返回 getdate() 在 FromDate 和 Todate 之间的记录 - SQL Server How can we write a sql function which returns the records where getdate() is between FromDate and Todate 如何编写 SQL 查询以找出在 Sql Server 2005 中哪些登录已被授予哪些权限? - How to write an SQL query to find out which logins have been granted which rights in Sql Server 2005? SQL查询:如何选择相关表中所有记录具有特定属性值的记录 - SQL query : how to select records which all records in related table have specific value of attribute 如何在SQL查询中拥有某些记录没有值的列 - how to have columns which does not have values for certain records in SQL query
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM