简体   繁体   English

如何防止PostgreSQL全文搜索解析器将符号重写为空格?

[英]How to prevent PostgreSQL full text search parser rewriting symbols to spaces?

My problem is that the PL/pgSQL parser treats symbols like '#' or '+' as space symbols (which is OK) hence the queries like 'C++' or 'C#' or 'PL/SQL' are parsed like so: 我的问题是PL / pgSQL解析器将'#'或'+'等符号视为空格符号(可以),因此像'C ++'或'C#'或'PL / SQL'这样的查询被解析为:

 asciiword | Word, all ASCII | C     | {english_stem}        | english_stem | {c}
 blank     | Space symbols   | #     | {thesaurus_en,simple} | simple       | {#}

I'm trying to find a best way to handle this kind of queries. 我正在尝试找到一种处理此类查询的最佳方法。 I've been trying to accomplish that by using the thesaurus dictionary, but it doesn't look like it could possibly work. 我一直在尝试通过使用同义词库字典来实现这一点,但是看起来它可能无法工作。

What I'm thinking of is something that rewrites "C#" to "CSHARP" while writing to the database (since I guess "C#" would be indexed as "C") and something that would do the same while searching. 我在想什么是在写入数据库时​​将“ C#”重写为“ CSHARP”的方法(因为我猜“ C#”将被索引为“ C”),并且在搜索时也会执行相同的操作。

I could possibly do it on my web application side, but it just doesn't seem right. 我可以在我的Web应用程序端执行此操作,但这似乎并不正确。

How would I handle that or what PL/pgSQL triggers could I possibly use for the approach I'm thinking of? 我将如何处理该问题,或者可能将哪种PL / pgSQL触发器用于我正在考虑的方法?

Well, you could write your own parser (in C) but that's probably more effort than you wanted to go to. 好了,您可以编写自己的解析器(用C语言编写),但这可能比您想做的工作还要多。

You could do something like: 您可以执行以下操作:

to_tsvector('english', my_transformer(document_text)) 
...
to_tsquery('english', my_transformer(query_text))

You don't need to transform the actual literal document text, just the tsvector index and the query. 您不需要转换实际的文字文档文本,只需转换tsvector索引和查询即可。 You can do this in the index-definition too (but my_transformer needs to be an immutable function). 您也可以在索引定义中执行此操作(但my_transformer必须是一个不变的函数)。

The question then becomes what the simplest/most efficient way to transform the incoming text is. 然后,问题就变成了转换传入文本的最简单/最有效的方法是什么。 If you're already using plperl/pltcl then you could probably do some clever regex replacement. 如果您已经在使用plperl / pltcl,则可以进行一些巧妙的正则表达式替换。 If not, try several simpler regex replacements in plpgsql or even plsql. 如果不是这样,请尝试在plpgsql甚至plsql中尝试一些更简单的正则表达式替换。 There are always fiddly corner-cases with this sort of thing though, so make sure you test your replacements thoroughly. 虽然总会有一些奇怪的情况发生,但是请确保您彻底测试了替代产品。

(Posted on behalf of the OP.) (代表OP发布。)

For future reference, there's a great guide on creating tsearch parser here: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html 为了将来参考,这里有关于创建tsearch解析器的很好的指南: http : //www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html

Anyway, the solution suggested by Richard works just fine and required much less effort. 无论如何,理查德(Richard)建议的解决方案效果很好,所需的工作量也大大减少。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM