简体   繁体   English

Postgres pg_trgm如何比较字符串数组的相似度

[英]Postgres pg_trgm how to compare similarity for array of strings

I'm attempting to use pg_trgm for string fuzzy matching and I know it may be used like this:我正在尝试使用pg_trgm进行字符串模糊匹配,我知道它可以这样使用:

SELECT * FROM artists WHERE SIMILARITY(name, 'Claud Monay') > 0.4;

where a scalar value may be used to compare against the similarity.其中标量值可用于与相似度进行比较。 However, I've seen this way of using SIMILARITY with an array of strings:但是,我已经看到了将SIMILARITY与字符串数组一起使用的这种方式:

SELECT * FROM artists WHERE 'Cadinsky' % ANY(STRING_TO_ARRAY(name, ' '));

which uses the % operator which is a shorthand for comparing against the default value of 0.3 .它使用%运算符,这是与默认值0.3进行比较的简写。 I'm trying to find the proper syntax to use ANY(STRING_TO_ARRAY(...)) but with the first form where an arbitrary scalar value may be given to compare the similarity against.我正在尝试找到正确的语法来使用ANY(STRING_TO_ARRAY(...))但在第一种形式中,可以给出任意标量值来比较相似性。

This is, most likely, just a simple question of properly using the syntax for ANY , but I'm failing at understanding what the correct form is.这很可能只是正确使用ANY语法的一个简单问题,但我无法理解正确的形式是什么。

There is no syntax to use ANY with 3 arguments (the string, the array of strings, and the similarity threshold).没有语法可以将 ANY 与 3 arguments(字符串、字符串数组和相似度阈值)一起使用。 The way to do it is to set pg_trgm.similarity_threshold to the value you want rather than the default of 0.3, and then use % ANY .这样做的方法是将 pg_trgm.similarity_threshold 设置为您想要的值而不是默认值 0.3,然后使用% ANY

If you want to use different thresholds in different parts of the query, you are out of luck with the ANY construct.如果您想在查询的不同部分使用不同的阈值,那么 ANY 构造就不走运了。

You can always define your own function, but you will probably not be able to get it to use an index.您始终可以定义自己的 function,但您可能无法让它使用索引。

create or replace function most_similar(text, text[]) returns double precision 
language sql as $$ 
    select max(similarity($1,x)) from unnest($2) f(x) 
$$;

SELECT * FROM artists WHERE most_similar('Cadinsky', STRING_TO_ARRAY(name, ' '))>0.4;

I am not a DB expert nor good at SQL but here is my solution.我不是数据库专家,也不擅长 SQL,但这是我的解决方案。

I basically use a function called unnest() .我基本上使用名为unnest()的 function 。 Thus, I can iterate over the array and check the similarity value for each item then compare it to similarity input, which is a float .因此,我可以遍历数组并检查每个项目的相似度值,然后将其与相似度输入(即float )进行比较。

Using something like set pg_trgm.similarity_threshold=0.6;使用类似set pg_trgm.similarity_threshold=0.6; is a global setting as far as I know.据我所知,这是一个全局设置。 The question is specifically asking for an explicit threshold .问题是专门要求一个明确的阈值

Also, if you create a function to do the job and the function is not VOLATILE but is STABLE , you cannot use set pg_trgm.similarity_threshold .此外,如果您创建 function 来完成这项工作,并且 function 不是VOLATILE而是STABLE则不能使用set pg_trgm.similarity_threshold (At least that was what happened to me). (至少那是发生在我身上的事情)。

Caution: I didn't compare my approach to (ANY) approach in terms of performance.警告:我没有在性能方面将我的方法与(任何)方法进行比较。

Example Code:示例代码:

CREATE OR REPLACE FUNCTION your_function_name (input text, similarity float) RETURNS
SELECT * FROM your_table_name
WHERE EXISTS
   (SELECT
       FROM unnest(ARRAY['item','anotherItem', 'third-ish']) element
       WHERE SIMILARITY (input, element) > similarity
   );
$ function $

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM