简体   繁体   English

在Lucene文档中“提升”同一字段的不同实例

[英]“boosting” different instances of the same field in a lucene document

I want to use a single field to index the document's title and body, in an effort to improve performance. 我想使用一个字段来索引文档的标题和正文,以提高性能。

The idea was to do something like this: 想法是做这样的事情:

Field title = new Field("text", "alpha bravo charlie", Field.Store.NO, Field.Index.ANALYZED);
title.setBoost(3)
Field body = new Field("text", "delta echo foxtrot", Field.Store.NO, Field.Index.ANALYZED);
Document doc = new Document();
doc.add(title);
doc.add(body);

And then I could just do a single TermQuery instead of a BooleanQuery for two separate fields. 然后,我可以只对两个单独的字段执行一个TermQuery而不是BooleanQuery

However, it turns out that a field boost is the multiple of all the boost of fields of the same name in the document. 但是,事实证明,字段提升是文档中相同名称的所有字段提升的倍数 In my case, it means that both fields have a boost of 3. 在我的情况下,这意味着两个字段都提高了3。

Is there a way I can get what I want without resorting to using two different fields? 有没有一种方法可以使我不需使用两个不同的字段就能得到我想要的东西? One way would be to add the title field several times to the document, which increases the term frequency. 一种方法是将title字段多次添加到文档中,这会增加术语频率。 This works, but seems incredibly brain-dead. 这可行,但似乎令人难以置信。

I also know about payloads , but that seems like an overkill for what I'm after. 我也知道有效载荷 ,但这似乎对我追求的目标来说是一个过大的杀伤力。

Any ideas? 有任何想法吗?

If you want to take a page out of Google's book (at least their old book), then you may want to create separate indexes: one for document bodies, another for titles. 如果要从Google的书(至少是他们的旧书)中取出一页,则可能要创建单独的索引:一个用于文档正文,另一个用于标题。 I'm assuming there is a field stored that points to a true UID for each actual document. 我假设存储的字段指向每个实际文档的真实UID。

The alternative answer is to write custom implementations of [Similarity][1] to get the behavior you want. 替代方法是编写[Similarity] [1]的自定义实现,以获取所需的行为。 Unfortunately I find that Lucene often needs this customization unique problems arise. 不幸的是,我发现Lucene经常需要这种定制,从而产生了独特的问题。

[1]: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String , int) [1]: http : //lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String ,int)

You can index title and body separately with title field boosted by a desired value. 您可以将标题和正文分别编入索引,标题字段增加所需的值。 Then, you can use MultiFieldQueryParser to search multiple fields. 然后,您可以使用MultiFieldQueryParser搜索多个字段。

While, technically, searching multiple fields takes longer time, typically even with this overhead, Lucene tends to be extremely fast (of the order of few tens or hundreds of milliseconds.) 从技术上讲,搜索多个字段会花费较长的时间,通常即使有此开销,Lucene也会非常快(几十或几百毫秒的量级)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM