简体   繁体   English

solr多值域的评分

[英]Scoring of solr multivalued field

If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? 如果我在Solr中有一个带有多值字段的文档,那么多个值是独立得分还是连接并评分为一个大字段? I'm hoping they're scored independently. 我希望他们能够独立得分。 Here's an example of what I mean: 这是我的意思的一个例子:

I have a document with a field for a person's name, where there may be multiple names for the same person. 我有一个文件,其中包含一个人名的字段,同一个人可能有多个名字。 The names are all different (very different in some cases) but they all are the same person/document. 名称都是不同的(在某些情况下非常不同),但它们都是同一个人/文件。

Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke 人1:大卫鲍伊,大卫罗伯特琼斯,Ziggy Stardust,薄白公爵

Person 2: David Letterman 人2:大卫莱特曼

Person 3: David Hasselhoff, David Michael Hasselhoff 第3人:David Hasselhoff,David Michael Hasselhoff

If I were to search for "David" I'd like for all of these to have about the same chance of a match. 如果我要搜索“大卫”,我希望所有这些都能获得相同的匹配机会。 If each name is scored independently that would seem to be the case. 如果每个名称都是独立评分的,那就好像是这样。 If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. 如果它们只是作为单个字段进行存储和搜索,David Bowie将会因为拥有更多令牌而受到惩罚。 How does Solr handle this scenario? Solr如何处理这种情况?

You can just run your query q=field_name:David with debugQuery=on and see what happens. 你可以运行你的查询q=field_name:David with debugQuery=on ,看看会发生什么。

These are the results (included the score through fl=*,score ) sorted by score desc : 这些是按score desc排序的结果(包括通过fl=*,score ):

<doc>
    <float name="score">0.4451987</float>
    <str name="id">2</str>
    <arr name="text_ws">
        <str>David Letterman</str>
    </arr>
</doc>
<doc>
    <float name="score">0.44072422</float>
    <str name="id">3</str>
    <arr name="text_ws">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.314803</float>
    <str name="id">1</str>
    <arr name="text_ws">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>

And this is the explanation: 这就是解释:

<lst name="explain">
    <str name="2">
        0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
    </str>
    <str name="3">
        0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
    </str>
    <str name="1">
        0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
    </str>
</lst>

The scoring factors here are: 这里的得分因素是:

  • termFreq : how often a term appears in the document termFreq :术语出现在文档中的频率
  • idf : how often the term appears across the index idf :该术语在索引中出现的频率
  • fieldNorm : importance of the term, depending on index-time boosting and field length fieldNorm :该术语的重要性,取决于索引时间提升和字段长度

In your example the fieldNorm makes the difference. 在您的示例中, fieldNorm有所不同。 You have one document with lower termFreq (1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length. 您有一个文件具有较低的termFreq (1而不是1.4142135),因为该术语只出现一次,但由于字段长度,该匹配更重要。

The fact that your field is multiValued doesn't change the scoring. 您的字段为multiValued的事实不会改变评分。 I guess it would be the same with a single value field with the same content. 我想这与具有相同内容的单个值字段相同。 Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. Solr在字段长度和术语方面工作,所以,是的,David Bowie因为拥有比其他人更多的令牌而受到惩罚。 :) :)

UPDATE UPDATE
I actually think David Bowie deserves his opportunity. 我实际上认为大卫鲍伊值得他的机会。 Like explained above, the fieldNorm makes the difference. 如上所述, fieldNorm有所不同。 Add the attribute omitNorms=true to your text_ws field in the schema.xml and reindex. 将属性omitNorms=true添加到schema.xml text_ws字段并重新索引。 The same query will give you the following result: 相同的查询将为您提供以下结果:

<doc>
    <float name="score">1.0073696</float>
    <str name="id">1</str>
    <arr name="text">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>
<doc>
    <float name="score">1.0073696</float>
    <str name="id">3</str>
    <arr name="text">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.71231794</float>
    <str name="id">2</str>
    <arr name="text">
        <str>David Letterman</str>
    </arr>
</doc>

As you can see now the termFreq wins and the fieldNorm is not taken into account at all. 正如您现在所看到的, termFreq获胜并且fieldNorm根本没有被考虑在内。 That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. 这就是为什么带有两个David出现的两个文档在顶部并且具有相同的分数,尽管它们的长度不同,并且只有一个匹配的较短文档是具有最低分数的最后一个。 Here's the explanation with debugQuery=on : 这是debugQuery=on的解释:

<lst name="explain">
   <str name="1">
      1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
   </str>
   <str name="3">
      1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
   </str>
   <str name="2">
      0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
   </str>
</lst>

you could use Lucenes SweetSpotSimilarity to define the plateau of lengths that should all have a norm of 1.0. 你可以使用Lucenes SweetSpotSimilarity来定义长度的平台,这些长度应该都是1.0的标准。 this could help you with your situation as long as you are searching for stuff like names etc. lengthNorm doesn't do any good. 只要你在寻找名字之类的东西,这可以帮助你处理你的情况.lengthNorm没有任何好处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM