简体   繁体   中英

Scoring of solr multivalued field

If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? I'm hoping they're scored independently. Here's an example of what I mean:

I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.

Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke

Person 2: David Letterman

Person 3: David Hasselhoff, David Michael Hasselhoff

If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?

You can just run your query q=field_name:David with debugQuery=on and see what happens.

These are the results (included the score through fl=*,score ) sorted by score desc :

<doc>
    <float name="score">0.4451987</float>
    <str name="id">2</str>
    <arr name="text_ws">
        <str>David Letterman</str>
    </arr>
</doc>
<doc>
    <float name="score">0.44072422</float>
    <str name="id">3</str>
    <arr name="text_ws">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.314803</float>
    <str name="id">1</str>
    <arr name="text_ws">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>

And this is the explanation:

<lst name="explain">
    <str name="2">
        0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
    </str>
    <str name="3">
        0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
    </str>
    <str name="1">
        0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
    </str>
</lst>

The scoring factors here are:

  • termFreq : how often a term appears in the document
  • idf : how often the term appears across the index
  • fieldNorm : importance of the term, depending on index-time boosting and field length

In your example the fieldNorm makes the difference. You have one document with lower termFreq (1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.

The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)

UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the fieldNorm makes the difference. Add the attribute omitNorms=true to your text_ws field in the schema.xml and reindex. The same query will give you the following result:

<doc>
    <float name="score">1.0073696</float>
    <str name="id">1</str>
    <arr name="text">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>
<doc>
    <float name="score">1.0073696</float>
    <str name="id">3</str>
    <arr name="text">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.71231794</float>
    <str name="id">2</str>
    <arr name="text">
        <str>David Letterman</str>
    </arr>
</doc>

As you can see now the termFreq wins and the fieldNorm is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation with debugQuery=on :

<lst name="explain">
   <str name="1">
      1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
   </str>
   <str name="3">
      1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
   </str>
   <str name="2">
      0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
   </str>
</lst>

you could use Lucenes SweetSpotSimilarity to define the plateau of lengths that should all have a norm of 1.0. this could help you with your situation as long as you are searching for stuff like names etc. lengthNorm doesn't do any good.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM