简体   繁体   English

在PostgreSQL中使用to_tsvector和to_tsquery处理错别字

[英]Handling typos with to_tsvector and to_tsquery in postgresql

I have a simple table with these fields 我有一个包含这些字段的简单表格 在此处输入图片说明

Last two fields are for indexing one with tsvector datatype and other with text datatype. 最后两个字段用于索引一个使用tsvector数据类型的索引,另一个使用文本数据类型的索引。

I want to perform a query on name or id. 我想对名称或ID进行查询。 I am doing this 我正在做这个

SELECT * FROM foo WHERE foo.searchtext @@ to_tsquery('1234 & abcd');  

It is working fine but now I want typos to be removed eg if the name is abcd I type abbd then it should get all possible values. 它工作正常,但现在我希望删除拼写错误,例如,如果名称是abcd,我键入abbd,那么它应该获取所有可能的值。 I have seen pg_tgrm() but It does not work with integers or tsvector 我见过pg_tgrm()但它不适用于整数或tsvector

There are other options I have tried to use pg_tgrm() with like I have stored my index in another field searchtextstring with type text and query like 我尝试过使用pg_tgrm()其他选项,例如我将索引存储在另一个具有文本类型和查询的字段searchtextstring中,例如

select *
      from foo
    where searchtextstring % '123' and searchtextstring % 'abbd';

but I don't think this is efficient and also this does not work for typos. 但是我认为这不是有效的方法,而且对于错别字也不起作用。

So How can I handle typos with to_tsquery? 那么,如何使用to_tsquery处理错别字?

Thanks 谢谢

Full text search only ignores differences in stemming and capitalization, it won't allow you to find matches based on similarity. 全文搜索仅忽略词干和大写字母之间的差异,它不允许您基于相似度查找匹配项。

pg_trgm is the way to go. pg_trgm是必经之路。

I use this sample table: 我使用以下示例表:

CREATE TABLE foo (id integer PRIMARY KEY, searchtextstring text);

INSERT INTO foo VALUES (1, 'something 0987');
INSERT INTO foo VALUES (2, 'abbd 1224');

CREATE INDEX ON foo USING gist (searchtextstring gist_trgm_ops);

This is so small that PostgreSQL will always use a sequential scan, so let's force PostgreSQL to use an index if possible (so that we can simulate a larger table): 它是如此之小,以至于PostgreSQL将始终使用顺序扫描,因此,如果可能的话,让我们强迫PostgreSQL使用索引(以便我们可以模拟更大的表):

SET enable_seqscan = off;

Now let's query: 现在让我们查询:

EXPLAIN (COSTS off)
   SELECT * FROM foo WHERE searchtextstring % '1234'
                       AND searchtextstring % 'abcd';

                       QUERY PLAN                                        
--------------------------------------------------------
 Index Scan using foo_searchtextstring_idx on foo
   Index Cond: ((searchtextstring % '1234'::text)
            AND (searchtextstring % 'abcd'::text))
(2 rows)

The index is used quite well, with a single index scan! 索引使用得很好,只需一次索引扫描!

But the query returns no rows: 但是查询不返回任何行:

SELECT * FROM foo WHERE searchtextstring % '1234'
                    AND searchtextstring % 'abcd';

 id | searchtextstring 
----+------------------
(0 rows)

That is not because “it is not working”, but because the words are not similar enough. 那不是因为“它不起作用”,而是因为这两个词不够相似。 Don't forget that there are not so many trigrams in a four-letter word, so if you change one letter, they are not so similar any more. 不要忘记,四个字母的单词中没有太多的字母,因此,如果您更改一个字母,它们将不再相似。 That's not surprising, right? 这不足为奇,对吧?

So we have to lower the similarity threshold to get a result; 因此,我们必须降低相似度阈值才能获得结果;

SET pg_trgm.similarity_threshold = 0.1;

SELECT * FROM foo WHERE searchtextstring % '1234'
                    AND searchtextstring % 'abcd';

 id | searchtextstring 
----+------------------
  2 | abbd 1224
(1 row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM