简体   繁体   English

为什么在Lucene 4中将索引提升到文档上的字段会产生如此巨大的影响

[英]Why does index boosting a field to on a document have such a dramatic effect in Lucene 4

At index time I boost the alias field of a small set of documents, setting the boost to 2.0f, which I thought meant equivalent to doubling the score this doc would get over another doc, everything else being equal. 在索引时,我将一小部分文档的别名字段增强,将boost设置为2.0f,我认为这意味着将该文档的分数加倍会超过另一个文档,而其他所有条件都是相同的。

public class ArtistBoostDoc {

    //Double the score of this doc if it comes up in search
    private static float ARTIST_DOC_BOOST = 2.0f;

    private static Set<String> artistGuIdSet = new HashSet<String>();

    static  {

        artistGuIdSet.add("24f1766e-9635-4d58-a4d4-9413f9f98a4c"); //Bach
        artistGuIdSet.add("1f9df192-a621-4f54-8850-2c5373b7eac9"); //Beethoven
        artistGuIdSet.add("b972f589-fb0e-474e-b64a-803b0364fa75"); //Mozart
        artistGuIdSet.add("ad79836d-9849-44df-8789-180bbc823f3c"); //Vivaldi
        artistGuIdSet.add("27870d47-bb98-42d1-bf2b-c7e972e6befc"); //Handel
        artistGuIdSet.add("8255db36-4902-4cf6-8612-0f2b4288bc9a"); //Johann Strauss II
        artistGuIdSet.add("eefd7c1e-abcf-4ccc-ba60-0fd435c9061f"); //Richard Wagner
        artistGuIdSet.add("4e60a56a-514a-4a19-a3cc-49927c96b3cb"); //Sir Edward Elgar
        artistGuIdSet.add("c130b0fb-5dce-449d-9f40-1437f889f7fe"); //Joseph Haydn
        artistGuIdSet.add("f91e3a88-24ee-4563-8963-fab73d2765ed"); //Franz Schubert
        artistGuIdSet.add("c70d12a2-24fe-4f83-a6e6-57d84f8efb51"); //Johannes Brahms
        artistGuIdSet.add("f1bedf1f-4445-4651-9c35-f4a3f3860a13"); //Guiseppe Verdi
    }

    public static void boost(String artistGuid, MbDocument doc) {

        boost(artistGuid,doc.getLuceneDocument());
    }

    public static void boost(String artistGuid, Document doc) {
        if(artistGuIdSet.contains(artistGuid)) {
            for(IndexableField indexablefield:doc.getFields())
            {
if(indexablefield.name().equals(ArtistIndexField.ALIAS.getName()))
                {
                    Field field = (Field)indexablefield;
                    field.setBoost(ARTIST_DOC_BOOST);
                }
            }
        }
    }
}

But then when I run this query: 但是然后当我运行此查询时:

http://search.musicbrainz.org/?type=artist&query=Jean&explain=true http://search.musicbrainz.org/?type=artist&query=Jean&explain=true

You can see that the first doc (which was indexed boosted) has a fieldnorm of 7.5161928 E9 (note the E) compared to 1.0 for the next result. 您可以看到第一个文档(已索引增强)的字段范数为7.5161928 E9(请注意E),而下一个结果为1.0。 basically whenever one of these boosted docs is matched on its alias field it will always be the first result and once results have been normalized it will have a score of 100, and all other results a score of zero. 基本上,只要这些增强文档中的一个在其别名字段上匹配,它将始终是第一个结果,对结果进行规范化后,其得分将为100,其他所有结果的得分为零。

If I remove the boosting then things work as expected (but trouble is I need some kind of boost for these documents and now dont have it) 如果我删除了增强功能,那么一切都会按预期进行(但麻烦的是,我需要这些文档的某种增强功能,但现在还没有)

http://search.beta.musicbrainz.org/?type=artist&query=Jean&explain=true http://search.beta.musicbrainz.org/?type=artist&query=Jean&explain=true

Why is the boosting the field to just 2.0 having such a dramatic effect 为什么将领域提升到2.0才有如此巨大的效果

That seems strange, yes. 好像很奇怪,是的。 There isn't anything in your code that definitively causes this, but I have a strong suspicion. 您的代码中没有任何东西最终会导致这种情况,但我对此深有怀疑。

I don't quite know what an MbDocument looks like, but I'm guessing it can have multiple fields with the same name. 我不太了解MbDocument外观,但是我猜它可以包含多个具有相同名称的字段。 I would guess JS Bach, in fact, has about 30 fields with the name ArtistIndexField.ALIAS.getName() . 我猜想JS Bach实际上有大约30个字段,名称为ArtistIndexField.ALIAS.getName()

The gotcha here is how Lucene handles multiple documents with the same name. Lucene如何处理具有相同名称的多个文档。 Lucene appends them all into the same field, and multiplies their boost . Lucene将它们全部添加到同一字段中,并将它们的boost乘以

So rather than a representation like: 因此,而不是像这样的表示形式:

alias1^2
alias2^2
alias3^2
....

You end up with something like: 您最终会得到类似:

(alias1 alias2 alias3 ...)^(2^30)

You'll either need to concatenate them all into a single field yourself (be sure that there are spaces between concatenated terms, lest they run together into one term when indexed), or make sure only one of the added fields for alias will be boosted when added to the document. 您要么需要自己将它们全部串联到一个字段中(请确保所连接的术语之间有空格,以免在索引时它们一起成为一个术语),或者确保仅增加为别名添加的一个字段添加到文档中时。

And I'm just going to provide this link , as a bit of friendly advice. 我将提供此链接 ,作为一些友好的建议。 Do with it what you see fit. 用它认为合适的方法来做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM