简体   繁体   English

在Lucene中获得词干

[英]Get stemmed word in Lucene

In Lucene I use the SnowballAnalyzer for indexing and searching. 在Lucene中,我使用SnowballAnalyzer进行索引和搜索。

When I have the index built I make queries on my index. 建立索引后,我会对索引进行查询。 For example I make a query 'specialized' for the field 'body'. 例如,我对字段“ body”进行查询“ specialized”。 IndexSearcher returns documents containing 'specialize, specialized etc.' IndexSearcher返回包含“专业化,专业化等”的文档。 because of the stemming done by the SnowballAnalyzer. 由于SnowballAnalyzer所做的处理。

Now - having top documents - I want to get a text snippet from the body field. 现在-拥有顶级文档-我想从正文字段中获取一个文本片段。 This snipped should contain the stemmed version of the query word. 该片段应包含查询词的词干版本。
For example one of the returned documents has the body field: "Unfortunately, in some states, blind people only have access to general rehabilitation agencies, which serve people with a variety of disabilities. In these cases, specialized services for visually impaired people are not always available." 例如,返回的文件之一具有“正文”字段:“不幸的是,在某些州,盲人只能进入为各种残障人士服务的普通康复机构。在这种情况下,没有为视障人士提供专门服务始终可用。” Then I wish to get the part 'In these cases, specialized services for visually' as the snippet. 然后,我希望获得“在这些情况下,为视觉提供专门服务”这一部分。 Additionally I want to have terms from this snippet. 另外,我想从这个片段中获得一些术语。 Code which will do it, but with one marked '?' 可以执行此操作的代码,但带有一个标记为“?”的代码 character, where I have a question is: 角色,我有一个问题是:

How I want to do it is IndexReader ir = IndexReader.open(fsDir);
TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body");
我要怎么做是IndexReader ir = IndexReader.open(fsDir);
TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body");
IndexReader ir = IndexReader.open(fsDir);
TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body");

? - here: query - query has to be the term. -在这里:查询-查询必须是术语。 So if the real query was 'specialized' then the query should be specialize, what normally the snowball analyzer does. 因此,如果实际查询是“专业化”的,则该查询应该是专业化的,雪球分析仪通常会这样做。 How can I get the term analyzed by the analyzer for a single word or a phrase, since query can contain a phrase: "specialized machines". 我如何获得分析器针对单个单词或短语分析的术语,因为查询可以包含短语:“专用机器”。

int idx = tv.indexOf(query);
int [] idxs = tv.getTermPositions(idx);
for(String t : tv.getTerms()){
int iidx = tv.indexOf(t);
int [] iidxs = tv.getTermPositions(iidx);
for(int ni : idxs){
tmpValue = 0.0f;
for(int nni : iidxs){
if(Math.abs(nni-ni)<= Settings.termWindowSize){

edit 编辑
I found the way to get the stemmed term: 我找到了获得词干的方法:
Query q = queryParser.parse("some text to be parsed"); String parsedQuery = q.toString();
There is a method for the Query object toString(String fieldName) ; 查询对象有一种方法toString(String fieldName) ;

I believe you are mixing several questions. 我相信您在混几个问题。 First, to see the stemmed version of your query, and other useful information, you can use the IndexSearcher's explain() method. 首先,要查看查询的词干版本和其他有用信息,可以使用IndexSearcher的explain()方法。 Please see my answer to this question . 请看我对这个问题的回答

The Lucene solution for getting snippets is the Highlighter . Lucene解决方案的片段是Highlighter Another option is the FastVectorHighlighter . 另一个选择是FastVectorHighlighter I believe you can customize both to get the stemmed term rather than the full one. 我相信您可以自定义二者以获取词干术语,而不是完整术语。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM