简体   繁体   English

使用“简单”和“英语”时,tsvector上的“to_tsquery”产生不同的结果?

[英]“to_tsquery” on tsvector yields different results when using “simple” and “english”?

I've been enlisted to help on a project and I'm diving back into PostgreSQL after not working with it for several years. 我已经被邀请参与一个项目的帮助,并且在几年没有使用它之后我又回到了PostgreSQL。 Lack of use aside, I've never run into using tsvector fields before and now find myself facing a bug based on them. 除了缺乏使用之外,我以前从未遇到使用tsvector字段,现在发现自己面临着基于它们的错误。 I read the documentation on the field type and it's purpose, but I'm having a hard time digging up documentation on how 'simple' differs from 'english' as the first parameter to to_tsquery() 我阅读了关于字段类型及其目的的文档,但是我很难将关于“简单”与“英语”的区别的文档作为to_tsquery()的第一个参数进行挖掘

Example

> SELECT to_tsvector('mortgag') @@ to_tsquery('simple', 'mortgage')
?column? 
----------
 f
(1 row)

> SELECT to_tsvector('mortgag') @@ to_tsquery('english', 'mortgage')
?column? 
----------
 t
(1 row)

I would think they should both return true, but obviously the first does not - why? 我认为他们都应该回归真实,但显然第一次没有 - 为什么?

The FTS utilizes dictionaries to normalize the text: FTS利用词典来规范化文本:

12.6. 12.6。 Dictionaries 字典

Dictionaries are used to eliminate words that should not be considered in a search ( stop words ), and to normalize words so that different derived forms of the same word will match. 字典用于消除在搜索中不应考虑的单词停用单词 ),并对单词进行标准化 ,以使相同单词的不同派生形式匹配。 A successfully normalized word is called a lexeme . 成功标准化的单词称为lexeme

So dictionaries are used to throw out things that are too common or meaningless to consider in a search ( stop words ) and to normalize everything else so city and cities , for example, will match even though they're different words. 因此,字典被用来丢弃在搜索中停留的常见或无意义的东西( 停用词 )并将其他所有内容归一化,例如城市城市即使它们是不同的词也会匹配。

Let us look at some output from ts_debug and see what's going on with the dictionaries: 让我们看看ts_debug一些输出,看看字典发生了什么:

=> select * from ts_debug('english', 'mortgage');
   alias   |   description   |  token   |  dictionaries  |  dictionary  |  lexemes  
-----------+-----------------+----------+----------------+--------------+-----------
 asciiword | Word, all ASCII | mortgage | {english_stem} | english_stem | {mortgag}

=> select * from ts_debug('simple', 'mortgage');
   alias   |   description   |  token   | dictionaries | dictionary |  lexemes   
-----------+-----------------+----------+--------------+------------+------------
 asciiword | Word, all ASCII | mortgage | {simple}     | simple     | {mortgage}

Notice that simple uses the simple dictionary whereas english uses the english_stem dictionary. 请注意, simple使用simple字典,而english使用english_stem字典。

The simple dictionary : simple字典

operates by converting the input token to lower case and checking it against a file of stop words. 通过将输入令牌转换为小写并根据停用词文件进行检查来进行操作。 If it is found in the file then an empty array is returned, causing the token to be discarded. 如果在文件中找到它,则返回一个空数组,导致该标记被丢弃。 If not, the lower-cased form of the word is returned as the normalized lexeme. 如果不是,则将该词的低句形式作为标准化词汇返回。

The simple dictionary just throws out stop words, downcases, and that's about it. simple字典只会抛出停用词,下颚,这就是它。 We can see its simplicity ourselves: 我们可以看到它的简单性:

=> select to_tsquery('simple', 'Mortgage'), to_tsquery('simple', 'Mortgages');
 to_tsquery | to_tsquery  
------------+-------------
 'mortgage' | 'mortgages'

The simple dictionary is too simple to even handle simple plurals. simple字典太简单,甚至无法处理简单的复数。

So what is this english_stem dictionary all about? 那么这个english_stem字典到底是什么? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. “词干”后缀是一个赠品:这个词典将词干算法应用于单词以将城市城市转换为相同的东西。 From the fine manual : 精细手册

12.6.6. 12.6.6。 Snowball Dictionary 雪球词典

The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. Snowball词典模板基于Martin Porter的一个项目,他是流行的Porter英语词干算法的发明者。 [...] Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. [...]每种算法都了解如何将单词的常见变体形式减少到其语言中的基础或词干拼写。

And just below that we see the english_stem dictionary: 在下面我们看到english_stem字典:

 CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english ); 

So the english_stem dictionary stems words and we can see that happen: 所以english_stem字典会产生词语,我们可以看到这种情况发生:

=> select to_tsquery('english', 'Mortgage'), to_tsquery('english', 'Mortgages');
 to_tsquery | to_tsquery 
------------+------------
 'mortgag'  | 'mortgag'

Executive Summary : 'simple' implies simple minded literal matching, 'english' applies stemming to (hopefully) produce better matching. 执行摘要'simple'意味着简单的文字匹配, 'english'适用于(希望)产生更好的匹配。 The stemming turns mortgage into mortgag and that gives you your match. 词干转按揭mortgag这给了你你的对手。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在PostgreSQL中使用to_tsvector和to_tsquery处理错别字 - Handling typos with to_tsvector and to_tsquery in postgresql Postgres to_tsquery-奇怪的结果 - Postgres to_tsquery - strange results PGSQL-to_tsquery和JOINS导致Rails中的错误 - PgSQL - to_tsquery and JOINS leads to Error in Rails 子查询单独使用时会产生不同的结果 - Subquery yields different results when used alone 在MINUS查询中指定列会产生与使用*不同的结果 - Specifying columns in MINUS query yields different results from using * 为什么在存储过程和简单的select命令中执行convert(datetime,getdate(),101)会产生不同的结果? - Why does convert(datetime,getdate(),101) yields different results while executed in stored procedure and a simple select command? 在 tsvector 触发器中使用 unaccent - Using unaccent in tsvector trigger 在不同服务器(相同版本)上执行相同的代码会产生不同的结果 - Same code executed on different servers (same version) yields different results 使用不同的SQL查询在表中查找重复项会产生不同的结果 - Finding duplicates in table with different sql queries yields different results 在SQL Server中使用数字数据类型进行算术运算会产生不同的结果 - Arithmetic operation with numeric datatype in SQL server yields different results
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM