简体繁体 English

在Lucene中获得词干

[英]Get stemmed word in Lucene

原文 2010-11-20 21:29:53 4 1 lucene/ snowballanalyzer

In Lucene I use the SnowballAnalyzer for indexing and searching. 在Lucene中，我使用SnowballAnalyzer进行索引和搜索。

When I have the index built I make queries on my index. 建立索引后，我会对索引进行查询。 For example I make a query 'specialized' for the field 'body'. 例如，我对字段“ body”进行查询“ specialized”。 IndexSearcher returns documents containing 'specialize, specialized etc.' IndexSearcher返回包含“专业化，专业化等”的文档。 because of the stemming done by the SnowballAnalyzer. 由于SnowballAnalyzer所做的处理。

Now - having top documents - I want to get a text snippet from the body field. 现在-拥有顶级文档-我想从正文字段中获取一个文本片段。 This snipped should contain the stemmed version of the query word. 该片段应包含查询词的词干版本。
For example one of the returned documents has the body field: "Unfortunately, in some states, blind people only have access to general rehabilitation agencies, which serve people with a variety of disabilities. In these cases, specialized services for visually impaired people are not always available." 例如，返回的文件之一具有“正文”字段：“不幸的是，在某些州，盲人只能进入为各种残障人士服务的普通康复机构。在这种情况下，没有为视障人士提供专门服务始终可用。” Then I wish to get the part 'In these cases, specialized services for visually' as the snippet. 然后，我希望获得“在这些情况下，为视觉提供专门服务”这一部分。 Additionally I want to have terms from this snippet. 另外，我想从这个片段中获得一些术语。 Code which will do it, but with one marked '?' 可以执行此操作的代码，但带有一个标记为“？”的代码 character, where I have a question is: 角色，我有一个问题是：

How I want to do it is IndexReader ir = IndexReader.open(fsDir); TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body"); 我要怎么做是IndexReader ir = IndexReader.open(fsDir); TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body"); IndexReader ir = IndexReader.open(fsDir); TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body");

? ？ - here: query - query has to be the term. -在这里：查询-查询必须是术语。 So if the real query was 'specialized' then the query should be specialize, what normally the snowball analyzer does. 因此，如果实际查询是“专业化”的，则该查询应该是专业化的，雪球分析仪通常会这样做。 How can I get the term analyzed by the analyzer for a single word or a phrase, since query can contain a phrase: "specialized machines". 我如何获得分析器针对单个单词或短语分析的术语，因为查询可以包含短语：“专用机器”。

int idx = tv.indexOf(query); int [] idxs = tv.getTermPositions(idx); for(String t : tv.getTerms()){ int iidx = tv.indexOf(t); int [] iidxs = tv.getTermPositions(iidx); for(int ni : idxs){ tmpValue = 0.0f; for(int nni : iidxs){ if(Math.abs(nni-ni)<= Settings.termWindowSize){

edit 编辑
I found the way to get the stemmed term: 我找到了获得词干的方法：
Query q = queryParser.parse("some text to be parsed"); String parsedQuery = q.toString();
There is a method for the Query object toString(String fieldName) ; 查询对象有一种方法toString（String fieldName） ;

1 个解决方案

I believe you are mixing several questions. 我相信您在混几个问题。 First, to see the stemmed version of your query, and other useful information, you can use the IndexSearcher's explain() method. 首先，要查看查询的词干版本和其他有用信息，可以使用IndexSearcher的explain（）方法。 Please see my answer to this question . 请看我对这个问题的回答。

The Lucene solution for getting snippets is the Highlighter . Lucene解决方案的片段是Highlighter 。 Another option is the FastVectorHighlighter . 另一个选择是FastVectorHighlighter 。 I believe you can customize both to get the stemmed term rather than the full one. 我相信您可以自定义二者以获取词干术语，而不是完整术语。