简体繁体 English

处理 SQL 查询的自然语言

[英]Processing natural language for SQL queries

原文 2022-07-30 15:07:58 3 1 c#/ sql/ postgresql/ aws-lambda/ nlp

I work in C# (Entity Framework) and Postgresql, but I'm not opposed to using Python or even Javascript.我在 C#（实体框架）和 Postgresql 工作，但我并不反对使用 Python 甚至 Z9E1713B69D1D24ADA9 I want to be able to process searches that produce relevant results.我希望能够处理产生相关结果的搜索。 For example, let's say I have a row in a very large database where display_name is Mike's® Discount Auto , and I want users to be able to search for it using a variety of ways.例如，假设我在一个非常大的数据库中有一行 display_name 是Mike's® Discount Auto ，我希望用户能够使用多种方式搜索它。 I've been using LINQ and Levenshtein distance stuff, but I can't seem to get it quite right.我一直在使用 LINQ 和 Levenshtein 距离的东西，但我似乎无法完全正确。 For the example above, I want the following searches to actually find Mike's® Discount Auto对于上面的示例，我希望以下搜索能够真正找到Mike's® Discount Auto

Mike's迈克的
Mikes迈克斯
Mike's®迈克的®
Mikes®迈克斯®
Miikes米克斯
Mikes discount auto迈克斯折扣汽车
discount auto折扣汽车

yada yada yada.亚达亚达亚达。 Each of my strategies seems to work ok , but there are huge gaps.我的每个策略似乎都可以正常工作，但存在巨大差距。 I use regex to remove non-alphanumeric characters, I use Levenshtein distance to search for misspellings, but even those 2 strategies won't effectively work if someone types in Mikes , because the Levenshtein distance is very high compared to something like Bobs discount auto .我使用正则表达式来删除非字母数字字符，我使用 Levenshtein distance 来搜索拼写错误，但是如果有人输入Mikes ，即使这两种策略也无法有效工作，因为与Bobs discount auto之类的相比，Levenshtein 距离非常高。 For the second example, the distance is lower, but obviously not the correct one.对于第二个示例，距离较低，但显然不是正确的。 Plus the more things I add, the slower the search becomes.加上我添加的东西越多，搜索就越慢。 Right now with a database consisting of ~330,000 rows, it takes almost a full minute from the http request -> lambda -> database -> back to client.现在有一个由约 330,000 行组成的数据库，从 http 请求 -> lambda -> 数据库 -> 返回客户端几乎需要一分钟。 That's not acceptable, My lambda most definitely needs to be faster.这是不可接受的，我的 lambda 绝对需要更快。 but it's my code that is really slowing it down但这是我的代码真的减慢了它

I'm looking for any resource in how to handle this effectively (ie books, websites, courses on udemy).我正在寻找有关如何有效处理此问题的任何资源（即有关 udemy 的书籍、网站、课程）。

1 个解决方案

Have you tried pg_trgm ?你试过pg_trgm吗？ It is not particularly intelligent (it doesn't understand parts of speech or synonyms or conjugation, for example) but neither is Levenshtein, and pg_trgm is usually much much faster as it can use an index.它不是特别聪明（例如，它不理解词性或同义词或共轭），但 Levenshtein 也不是，而且 pg_trgm 通常要快得多，因为它可以使用索引。

And neither one will have problems identifying "Mikes" as being more similar to 'Mikes Discount Auto' than to 'Bobs discount auto', so your example doesn't make sense.并且任何人都不会在将“迈克”识别为更类似于“迈克折扣汽车”而不是“鲍勃折扣汽车”时遇到问题，所以你的例子没有意义。