简体   繁体   English

Apache Lucene多个分词器

[英]Apache Lucene Multiple Tokenizers

I'm new to Lucene and have been trying to figure out a way to get an analyzer working. 我是Lucene的新手,并一直在尝试寻找一种使分析仪工作的方法。

I'd like my search string to be first split by a whitespace tokenizer, run through a keywordrepeatfilter, and then the non-keywords to be split by a standard analyzer. 我希望搜索字符串首先由空白标记器拆分,通过keywordrepeatfilter运行,然后由标准分析器拆分非关键字。

Ex. 防爆。 "This-is some text" -> "this" "is" "this-is" "some" "text" “这是文本”->“这”“是”“这是”“一些”“文本”

The WhitespaceAnalyzer alone wasn't working for what I wanted, so I started to try this, is there a way I can do this or should I try something different? 单独使用WhitespaceAnalyzer并不能满足我的需求,所以我开始尝试这种方法,有没有办法做到这一点,还是应该尝试其他方法?

You can only define one Tokenizer for an Analyzer. 您只能为分析器定义一个令牌生成器。 After the Tokenizer, further modifications to the TokenStream are made using TokenFilter s. 在Tokenizer之后,使用TokenFilter对TokenStream进行进一步的修改。 WordDelimiterFilter might be what you are looking for. WordDelimiterFilter可能正在寻找WordDelimiterFilter

You can use WordDelimiterFilter to achieve what you want. 您可以使用WordDelimiterFilter来实现所需的功能。

This is from lucene docs 这是来自lucene docs

WordDelimiterFilter Splits words into subwords and performs optional transformations on subword groups. WordDelimiterFilter将单词拆分为子单词,并对子单词组执行可选的转换。 Words are split into subwords with the following rules: - split on intra-word delimiters (by default, all non alpha-numeric characters). 单词按照以下规则分为子单词:-按单词内定界符(默认情况下,所有非字母数字字符)拆分。

The WordDelimiterFilter has PRESERVE_ORIGINAL property which willl split the "This-is some text" into following tokens WordDelimiterFilter具有PRESERVE_ORIGINAL属性,该属性会将“ This-is some text”拆分为以下标记

this, is,this-is,text 这是这是文本

Read more here : https://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html 在此处阅读更多信息: https : //lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM