简体   繁体   English

在Lucene中搜索TokenStream字段

[英]Searching TokenStream fields in Lucene

I am just starting out with Lucene, and I feel like I must have a fundamental misunderstanding of it, but from the samples and documentation I could not figure out this issue. 我刚刚开始使用Lucene,我觉得我必须对它有一个基本的误解,但是从样本和文档中我无法弄清楚这个问题。

I cannot seem to get Lucene to return results for fields which are initialized with a TokenStream , whereas fields initialized with a string work fine. 我似乎无法让Lucene返回用TokenStream初始化的字段的结果,而用string初始化的字段工作正常。 I am using Lucene.NET 2.9.2 RC2. 我正在使用Lucene.NET 2.9.2 RC2。

[Edit] I've also tried this with the latest Java version (3.0.3) and see the same behavior, so it is not some quirk of the port. [编辑]我也尝试使用最新的Java版本(3.0.3)并看到相同的行为,所以它不是端口的一些怪癖。

Here is a basic example: 这是一个基本的例子:

Directory index = new RAMDirectory();
Document doc = new Document();
doc.Add(new Field("fieldName", new StandardTokenizer(new StringReader("Field Value Goes Here"))));
IndexWriter iw = new IndexWriter(index, new StandardAnalyzer());
iw.AddDocument(doc);
iw.Commit();
iw.Close();
Query q = new QueryParser("fieldName", new StandardAnalyzer()).Parse("value");
IndexSearcher searcher = new IndexSearcher(index, true);
Console.WriteLine(searcher.Search(q).Length());

(I realize this uses APIs deprecated with 2.9, but that's just for brevity... pretend the arguments that specify the version are there and I use one of the new Search s). (我意识到这使用了不推荐使用2.9的API,但这仅仅是为了简洁...假装指定版本的参数存在,我使用新的Search )。

This returns no results. 这不会返回任何结果。

However, if I replace the line that adds the field with 但是,如果我替换添加字段的行

doc.Add(new Field("fieldName", "Field Value Goes Here", Field.Store.NO, Field.Index.ANALYZED));

then the query returns a hit, as I would expect. 然后查询返回命中,正如我所料。 It also works if I use the TextReader version. 如果我使用TextReader版本,它也有效。

Both fields are indexed and tokenized, with (I think) the same tokenizer/analyzer (I've also tried others), and neither are stored, so my intuition is that they should behave the same. 两个字段都被索引和标记化,(我认为)是相同的标记器/分析器(我也尝试过其他字段),并且都没有存储,所以我的直觉是它们的行为应该相同。 What am I missing? 我错过了什么?

I have found the answer to be casing. 我找到了答案是套管。

The token stream created by StandardAnalyzer has a LowerCaseFilter while creating the StandardTokenizer directly does not apply such a filter. StandardAnalyzer创建的令牌流具有LowerCaseFilter而直接创建StandardTokenizer不会应用此类过滤器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM