简体   繁体   English

在Lucene 4.4.0中搜索词干和精确词

[英]Search stem and exact words in Lucene 4.4.0

i've store a lucene document with a single TextField contains words without stems. 我已经存储了一个Lucene文档,其中的单个TextField包含没有词干的单词。

I need to implement a search program that allow users to search words and exact words, but if i've stored words without stemming, a stem search cannot be done. 我需要实现一个搜索程序,该程序允许用户搜索单词和精确单词,但是如果我存储的单词没有词干,则无法进行词干搜索。 There's a method to search both exact words and/or stemming words in Documents without store Two fields ? 有一种方法可以在不存储两个文档的文档中同时搜索精确词和/或词干词?

Thanks in advance. 提前致谢。

Indexing two separate fields seems like the right approach to me. 对两个单独的字段建立索引似乎对我来说是正确的方法。

Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser . 带茎和不带茎的文本需要不同的分析策略,因此需要您为QueryParser提供不同的Analyzer Lucene doesn't really support indexing text in the same field with different analyzers. Lucene并不真正支持使用不同的分析器对同一字段中的文本进行索引。 That is by design. 那是设计使然。 Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly). 此外,在同一字段中复制文本可能会产生一些相当奇怪的评分影响(尤其是对词干没有触及的术语进行更严格的评分)。

There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields. 无需文本存储在这些字段中的每个字段中,但是仅在单独的字段中对其进行索引才有意义。

You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper , by the way. 顺便说PerFieldAnalyzerWrapper ,您可以使用PerFieldAnalyzerWrapper将不同的分析器应用于不同的字段。 Like: 喜欢:

Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);

I can see a couple of possibilities to accomplish it though, if you really want to. 如果您确实愿意,我可以看到实现它的几种可能性。

One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. 一种是基于(或可能扩展)您希望已经使用的词干过滤器来创建自己的词干过滤器,并增加词干过滤后保留原始令牌的能力。 Mind your position increments, in this case. 在这种情况下,请注意您的位置增加。 Phrase queries and the like may be problematic. 短语查询等可能有问题。

The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. 另一种(可能更糟)的可能性是将文本正常添加到字段中,然后再次将其添加到同一字段中,但这一次是在手动阻止之后。 Two fields added with the same name will be effectively concatenated. 具有相同名称的两个字段将被有效地连接。 You'dd want to store in a separate field, in this case. 在这种情况下,您想将其存储在单独的字段中。 Expect wonky scoring. 期望得分不稳定。

Again, though, both of these are bad ideas . 同样,这两个都是坏主意 I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields. 对于仅对两个字段建立索引的更简单,更有用的方法,我认为这两种策略都没有任何好处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM