[英]“to_tsquery” on tsvector yields different results when using “simple” and “english”?
I've been enlisted to help on a project and I'm diving back into PostgreSQL after not working with it for several years. 我已经被邀请参与一个项目的帮助,并且在几年没有使用它之后我又回到了PostgreSQL。 Lack of use aside, I've never run into using tsvector fields before and now find myself facing a bug based on them. 除了缺乏使用之外,我以前从未遇到使用tsvector字段,现在发现自己面临着基于它们的错误。 I read the documentation on the field type and it's purpose, but I'm having a hard time digging up documentation on how 'simple' differs from 'english' as the first parameter to to_tsquery() 我阅读了关于字段类型及其目的的文档,但是我很难将关于“简单”与“英语”的区别的文档作为to_tsquery()的第一个参数进行挖掘
Example 例
> SELECT to_tsvector('mortgag') @@ to_tsquery('simple', 'mortgage')
?column?
----------
f
(1 row)
> SELECT to_tsvector('mortgag') @@ to_tsquery('english', 'mortgage')
?column?
----------
t
(1 row)
I would think they should both return true, but obviously the first does not - why? 我认为他们都应该回归真实,但显然第一次没有 - 为什么?
The FTS utilizes dictionaries to normalize the text: FTS利用词典来规范化文本:
12.6. 12.6。 Dictionaries 字典
Dictionaries are used to eliminate words that should not be considered in a search ( stop words ), and to normalize words so that different derived forms of the same word will match. 字典用于消除在搜索中不应考虑的单词 ( 停用单词 ),并对单词进行标准化 ,以使相同单词的不同派生形式匹配。 A successfully normalized word is called a lexeme . 成功标准化的单词称为lexeme 。
So dictionaries are used to throw out things that are too common or meaningless to consider in a search ( stop words ) and to normalize everything else so city and cities , for example, will match even though they're different words. 因此,字典被用来丢弃在搜索中停留的常见或无意义的东西( 停用词 )并将其他所有内容归一化,例如城市和城市即使它们是不同的词也会匹配。
Let us look at some output from ts_debug
and see what's going on with the dictionaries: 让我们看看ts_debug
一些输出,看看字典发生了什么:
=> select * from ts_debug('english', 'mortgage');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+-----------
asciiword | Word, all ASCII | mortgage | {english_stem} | english_stem | {mortgag}
=> select * from ts_debug('simple', 'mortgage');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | mortgage | {simple} | simple | {mortgage}
Notice that simple
uses the simple
dictionary whereas english
uses the english_stem
dictionary. 请注意, simple
使用simple
字典,而english
使用english_stem
字典。
The simple
dictionary : simple
字典 :
operates by converting the input token to lower case and checking it against a file of stop words. 通过将输入令牌转换为小写并根据停用词文件进行检查来进行操作。 If it is found in the file then an empty array is returned, causing the token to be discarded. 如果在文件中找到它,则返回一个空数组,导致该标记被丢弃。 If not, the lower-cased form of the word is returned as the normalized lexeme. 如果不是,则将该词的低句形式作为标准化词汇返回。
The simple
dictionary just throws out stop words, downcases, and that's about it. simple
字典只会抛出停用词,下颚,这就是它。 We can see its simplicity ourselves: 我们可以看到它的简单性:
=> select to_tsquery('simple', 'Mortgage'), to_tsquery('simple', 'Mortgages');
to_tsquery | to_tsquery
------------+-------------
'mortgage' | 'mortgages'
The simple
dictionary is too simple to even handle simple plurals. simple
字典太简单,甚至无法处理简单的复数。
So what is this english_stem
dictionary all about? 那么这个english_stem
字典到底是什么? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. “词干”后缀是一个赠品:这个词典将词干算法应用于单词以将城市和城市转换为相同的东西。 From the fine manual : 从精细手册 :
12.6.6. 12.6.6。 Snowball Dictionary 雪球词典
The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. Snowball词典模板基于Martin Porter的一个项目,他是流行的Porter英语词干算法的发明者。 [...] Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. [...]每种算法都了解如何将单词的常见变体形式减少到其语言中的基础或词干拼写。
And just below that we see the english_stem
dictionary: 在下面我们看到english_stem
字典:
CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english );
So the english_stem
dictionary stems words and we can see that happen: 所以english_stem
字典会产生词语,我们可以看到这种情况发生:
=> select to_tsquery('english', 'Mortgage'), to_tsquery('english', 'Mortgages');
to_tsquery | to_tsquery
------------+------------
'mortgag' | 'mortgag'
Executive Summary : 'simple'
implies simple minded literal matching, 'english'
applies stemming to (hopefully) produce better matching. 执行摘要 : 'simple'
意味着简单的文字匹配, 'english'
适用于(希望)产生更好的匹配。 The stemming turns mortgage into mortgag and that gives you your match. 词干转按揭到mortgag这给了你你的对手。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.