简体   繁体   中英

“boosting” different instances of the same field in a lucene document

I want to use a single field to index the document's title and body, in an effort to improve performance.

The idea was to do something like this:

Field title = new Field("text", "alpha bravo charlie", Field.Store.NO, Field.Index.ANALYZED);
title.setBoost(3)
Field body = new Field("text", "delta echo foxtrot", Field.Store.NO, Field.Index.ANALYZED);
Document doc = new Document();
doc.add(title);
doc.add(body);

And then I could just do a single TermQuery instead of a BooleanQuery for two separate fields.

However, it turns out that a field boost is the multiple of all the boost of fields of the same name in the document. In my case, it means that both fields have a boost of 3.

Is there a way I can get what I want without resorting to using two different fields? One way would be to add the title field several times to the document, which increases the term frequency. This works, but seems incredibly brain-dead.

I also know about payloads , but that seems like an overkill for what I'm after.

Any ideas?

If you want to take a page out of Google's book (at least their old book), then you may want to create separate indexes: one for document bodies, another for titles. I'm assuming there is a field stored that points to a true UID for each actual document.

The alternative answer is to write custom implementations of [Similarity][1] to get the behavior you want. Unfortunately I find that Lucene often needs this customization unique problems arise.

[1]: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String , int)

You can index title and body separately with title field boosted by a desired value. Then, you can use MultiFieldQueryParser to search multiple fields.

While, technically, searching multiple fields takes longer time, typically even with this overhead, Lucene tends to be extremely fast (of the order of few tens or hundreds of milliseconds.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM