Elasticsearch多字段模糊搜索不首先返回完全匹配

Question

我正在对“文本”和“关键字”字段执行模糊的Elasticsearch查询。 我在Elasticsearch中有两个文档，一个带有“文本”“ testPhone 5”，另一个带有“ testPhone 4s”。 当我使用“ testPhone 5”执行模糊查询时，我看到两个文档都得到了完全相同的得分值。 为什么会这样呢？

额外信息：我正在使用“ uax_url_email”令牌生成器和“小写”过滤器为文档建立索引。

这是我正在查询：

{
    query : {
        bool: {
            // match one or the other fuzzy query
            should: [
                {
                    fuzzy: {
                        text: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 5,
                        }
                    }
                },
                {
                    fuzzy: {
                        keywords: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 1,
                        }
                    }
                }
            ]
        }
    },
    sort: [ 
        '_score'
    ],
    explain: true
}

结果如下：

{ max_score: 0.47213298,
  total: 2,
  hits:
  [ { _index: 'test',
     _shard: 0,
     _id: '51fbf95f82e89ae8c300002c',
     _node: '0Mtfzbe1RDinU71Ordx-Ag',
     _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000027', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002c',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 5',
      keywords: [ [length]: 0 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 { _index: 'test',
   _shard: 4,
   _id: '51fbf95f82e89ae8c300002d',
   _node: '0Mtfzbe1RDinU71Ordx-Ag',
   _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000029', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002d',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 4s',
      keywords: [ 'apple', [length]: 1 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 [length]: 2 ] }

Answer 1

不会分析模糊查询，但会使用此字段，因此您搜索距离为0.4 testphone 5产生两个文档的已分析术语testphone ，并且该术语用于进一步过滤结果

描述：'weight（text： testphone ^ 3.8888888 in 0）[PerFieldSimilarity]，结果：'}，

另请参阅@imotov最佳答案： ElasticSearch的模糊查询

您可以查看使用_analyze API对字符串进行精确标记的_analyze

http://www.elasticsearch.org/guide/zh-CN/elasticsearch/reference/current/indices-analyze.html

即

http://localhost:9200/prefix_test/_analyze?field=text&text=testphone+5

将返回：

{
   "tokens": [
      {
         "token": "testphone",
         "start_offset": 0,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "5",
         "start_offset": 10,
         "end_offset": 11,
         "type": "<NUM>",
         "position": 2
      }
   ]
}

因此，即使您为testphone sammsung的值testphone sammsung索引，对“ testphone samsunk”的模糊查询也不会像samsunk那样产生任何samsunk 。

通过不分析（或使用关键字分析器）该字段，可能会得到更好的结果。

如果要对单个字段进行不同的分析，可以使用multi_field构造。

http://www.elasticsearch.org/guide/zh-CN/elasticsearch/reference/current/mapping-multi-field-type.html

Answer 2

我最近遇到了这个问题。 我无法确切告诉您原因，但可以告诉您如何解决：

我在同一字段上运行了2个查询，一个查询完全匹配，然后在同一字段上启用了模糊匹配并降低了提升的完全相同的查询。

这样可以确保我的精确匹配总是比模糊匹配更高。

附言：我认为他们的得分是相等的，因为由于模糊性，双方比赛和ES都不在乎只要双方比赛都是精确比赛，但这纯粹是理论上的努力，因为我对评分算法不太熟悉。

Elasticsearch多字段模糊搜索不首先返回完全匹配

问题描述

2 个解决方案

解决方案1
2 2014-01-22 19:12:24

解决方案2
0 2013-08-05 09:45:56

Elasticsearch多字段模糊搜索不首先返回完全匹配

问题描述

2 个解决方案

解决方案1 2 2014-01-22 19:12:24

解决方案2 0 2013-08-05 09:45:56

解决方案1
2 2014-01-22 19:12:24

解决方案2
0 2013-08-05 09:45:56