简体   繁体   中英

How to make Elasticsearch sort/prefer hits with exactly matching strings first

I'm using default analyzers and indexing. So let's say I have this simple mapping:

"question": {
    "properties": {
        "title": {
            "type": "string"
        },
        "answer": {
            "properties": {
                "text": {
                    "type": "string"
                }
            }
        }
    }
}

(that was an example. sorry if it has typos)

Now, I perform the following search.

GET _search
{
    "query": {
        "query_string": {
            "query": "yes correct",
            "fields": ["answer.text"]
        }
    }
}

The results will score a text value like "yes correct." (doc id value 1 ) higher than simply "yes correct" (without a period, doc id value 181 ). Both hits have the same score value, but the hits array lists the one with the smaller doc id first. I understand that the default index option includes sorting by doc id, so how do I exclude that one attribute and still use the rest of the default options?

I'm not setting any custom analyzers, so everything is using default values for Elasticsearch 2.0.

This is probably a use case for Dis Max Query

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.

So following that, you need to make your answer score as an exact match and give it highest boost. You'll have to use a custom analyzer for that. That'd be your mappings:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "question": {
      "properties": {
        "title": {
          "type": "string"
        },
        "answer": {
          "type": "object",
          "properties": {
            "text": {
              "type": "string",
              "analyzer": "my_keyword",
              "fields": {
                "stemmed": {
                  "type": "string",
                  "analyzer": "standard"
                }
              }
            }
          }
        }
      }
    }
  }
}

Your test data:

PUT /test/question/1
{
  "title": "title nr1",
  "answer": [
    {
      "text": "yes correct."
    }
  ]
}

PUT /test/question/2
{
  "title": "title nr2",
  "answer": [
    {
      "text": "yes correct"
    }
  ]
}

Now when you're querying for "yes correct." using such query:

POST /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.2,
      "queries": [
        {
          "match": {
            "answer.text": {
              "query": "yes correct.",
              "type": "phrase"
            }
          }
        },
        {
          "match": {
            "answer.text.stemmed": {
              "query": "yes correct.",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

You get this output:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.37919715,
      "hits": [
         {
            "_index": "test",
            "_type": "question",
            "_id": "1",
            "_score": 0.37919715,
            "_source": {
               "title": "title nr1",
               "answer": [
                  {
                     "text": "yes correct."
                  }
               ]
            }
         },
         {
            "_index": "test",
            "_type": "question",
            "_id": "2",
            "_score": 0.11261705,
            "_source": {
               "title": "title nr2",
               "answer": [
                  {
                     "text": "yes correct"
                  }
               ]
            }
         }
      ]
   }
}

If you run very same query without trailing dot, which then becomes "yes correct" , you're getting this result:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.37919715,
      "hits": [
         {
            "_index": "test",
            "_type": "question",
            "_id": "2",
            "_score": 0.37919715,
            "_source": {
               "title": "title nr2",
               "answer": [
                  {
                     "text": "yes correct"
                  }
               ]
            }
         },
         {
            "_index": "test",
            "_type": "question",
            "_id": "1",
            "_score": 0.11261705,
            "_source": {
               "title": "title nr1",
               "answer": [
                  {
                     "text": "yes correct."
                  }
               ]
            }
         }
      ]
   }
}

Hopefully this is what you're looking for.

By the way, I'd recommend to always use Match query when performing text search. Taken from documentation:

Comparison to query_string / field


The match family of queries does not go through a "query parsing" process. It does not support field name prefixes, wildcard characters, or other "advanced" features. For this reason, chances of it failing are very small / non existent, and it provides an excellent behavior when it comes to just analyze and run that text as a query behavior (which is usually what a text search box does). Also, the phrase_prefix type can provide a great "as you type" behavior to automatically load search results.

Elasticsearch or rather Lucene scoring does not take into account the relative positioning of the tokens. It utlizes 3 different criterias to do the same

  1. Term frequency - Frequency at which the search terms is present in the document
  2. Inverse document frequency - Number of occurrence of the search term in the entire database. The more the occurance , the more the common is the search term and less the importance it has in search
  3. Field length normalization - Number of tokens present in the target field.

You can learn more about it here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM