简体   繁体   中英

Why does index boosting a field to on a document have such a dramatic effect in Lucene 4

At index time I boost the alias field of a small set of documents, setting the boost to 2.0f, which I thought meant equivalent to doubling the score this doc would get over another doc, everything else being equal.

public class ArtistBoostDoc {

    //Double the score of this doc if it comes up in search
    private static float ARTIST_DOC_BOOST = 2.0f;

    private static Set<String> artistGuIdSet = new HashSet<String>();

    static  {

        artistGuIdSet.add("24f1766e-9635-4d58-a4d4-9413f9f98a4c"); //Bach
        artistGuIdSet.add("1f9df192-a621-4f54-8850-2c5373b7eac9"); //Beethoven
        artistGuIdSet.add("b972f589-fb0e-474e-b64a-803b0364fa75"); //Mozart
        artistGuIdSet.add("ad79836d-9849-44df-8789-180bbc823f3c"); //Vivaldi
        artistGuIdSet.add("27870d47-bb98-42d1-bf2b-c7e972e6befc"); //Handel
        artistGuIdSet.add("8255db36-4902-4cf6-8612-0f2b4288bc9a"); //Johann Strauss II
        artistGuIdSet.add("eefd7c1e-abcf-4ccc-ba60-0fd435c9061f"); //Richard Wagner
        artistGuIdSet.add("4e60a56a-514a-4a19-a3cc-49927c96b3cb"); //Sir Edward Elgar
        artistGuIdSet.add("c130b0fb-5dce-449d-9f40-1437f889f7fe"); //Joseph Haydn
        artistGuIdSet.add("f91e3a88-24ee-4563-8963-fab73d2765ed"); //Franz Schubert
        artistGuIdSet.add("c70d12a2-24fe-4f83-a6e6-57d84f8efb51"); //Johannes Brahms
        artistGuIdSet.add("f1bedf1f-4445-4651-9c35-f4a3f3860a13"); //Guiseppe Verdi
    }

    public static void boost(String artistGuid, MbDocument doc) {

        boost(artistGuid,doc.getLuceneDocument());
    }

    public static void boost(String artistGuid, Document doc) {
        if(artistGuIdSet.contains(artistGuid)) {
            for(IndexableField indexablefield:doc.getFields())
            {
if(indexablefield.name().equals(ArtistIndexField.ALIAS.getName()))
                {
                    Field field = (Field)indexablefield;
                    field.setBoost(ARTIST_DOC_BOOST);
                }
            }
        }
    }
}

But then when I run this query:

http://search.musicbrainz.org/?type=artist&query=Jean&explain=true

You can see that the first doc (which was indexed boosted) has a fieldnorm of 7.5161928 E9 (note the E) compared to 1.0 for the next result. basically whenever one of these boosted docs is matched on its alias field it will always be the first result and once results have been normalized it will have a score of 100, and all other results a score of zero.

If I remove the boosting then things work as expected (but trouble is I need some kind of boost for these documents and now dont have it)

http://search.beta.musicbrainz.org/?type=artist&query=Jean&explain=true

Why is the boosting the field to just 2.0 having such a dramatic effect

That seems strange, yes. There isn't anything in your code that definitively causes this, but I have a strong suspicion.

I don't quite know what an MbDocument looks like, but I'm guessing it can have multiple fields with the same name. I would guess JS Bach, in fact, has about 30 fields with the name ArtistIndexField.ALIAS.getName() .

The gotcha here is how Lucene handles multiple documents with the same name. Lucene appends them all into the same field, and multiplies their boost .

So rather than a representation like:

alias1^2
alias2^2
alias3^2
....

You end up with something like:

(alias1 alias2 alias3 ...)^(2^30)

You'll either need to concatenate them all into a single field yourself (be sure that there are spaces between concatenated terms, lest they run together into one term when indexed), or make sure only one of the added fields for alias will be boosted when added to the document.

And I'm just going to provide this link , as a bit of friendly advice. Do with it what you see fit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM