简体   繁体   English

Elasticsearch模糊查询-最大编辑无法按预期进行

[英]Elasticsearch fuzzy query - max edits doesn't work as expected

I have recently added "fuzzy operator" and fuzzy query settings to our search query string to cover user mistyping (eg "zamestnanost" vs. "zamestnani" ) 我最近在搜索查询字符串中添加了“模糊运算符”和模糊查询设置,以掩盖用户的错误信息(例如, “ zamestnanost”“ zamestnani”

POST /my_index/_search
{
   "query": {
      "query_string": {
         "query": "+(content:zamestnanost~)",
         "fuzzy_prefix_length": 3,
         "fuzzy_min_sim": 0.5, 
         "fuzzy_max_expansions": 50
      }
   }
}

As I understand fuzzy query settings, the fuzzy_min_sim = 0.5 should allow length(query)*0.5 edits of original query (in this case 6 edits). 据我了解的模糊查询设置, fuzzy_min_sim = 0.5应该允许对原始查询进行length(query)*0.5次编辑(在这种情况下为6次编辑)。

However, it doesn't match even "closer" words (tokens) like 但是,它甚至不匹配“更紧密的”单词(标记),例如

  • "zamestnani" “zamestnani”
  • "zamestnany" “zamestnany”

I have this strange feeling, that it still matches only words from index that are max. 我有一种奇怪的感觉,它仍然只匹配索引中最大的单词。 2 edits from the original query string (which is the default edit count in fuzzy query). 从原始查询字符串进行2次编辑(这是模​​糊查询中的默认编辑计数)。

I have also ran an explain on my query and the results supports this hypothesis, I think. 我认为我也对查询进行了解释,结果支持了这一假设。 The _explanation looks like this: _explanation看起来像这样:

"_explanation": {
   "value": 0.057083897,
   "description": "sum of:",
   "details": [
      {
         "value": 0.023866946,
         "description": "weight(content:zamestnano^0.8 in 0) [PerFieldSimilarity], result of:",
         "details": [
            {
               "value": 0.023866946,
               "description": "score(doc=0,freq=4.0), product of:",
               "details": [
                  {
                     "value": 0.66062796,
                     "description": "queryWeight, product of:",
                     "details": [
                        {
                           "value": 0.8,
                           "description": "boost"
                        },
                        {
                           "value": 4.624341,
                           "description": "idf(docFreq=1, maxDocs=75)"
                        },
                        {
                           "value": 0.17857353,
                           "description": "queryNorm"
                        }
                     ]
                  },
                  {
                     "value": 0.036127664,
                     "description": "fieldWeight in 0, product of:",
                     "details": [
                        {
                           "value": 2,
                           "description": "tf(freq=4.0), with freq of:",
                           "details": [
                              {
                                 "value": 4,
                                 "description": "termFreq=4.0"
                              }
                           ]
                        },
                        {
                           "value": 4.624341,
                           "description": "idf(docFreq=1, maxDocs=75)"
                        },
                        {
                           "value": 0.00390625,
                           "description": "fieldNorm(doc=0)"
                        }
                     ]
                  }
               ]
            }
         ]
      },
      {
         "value": 0.03321695,
         "description": "weight(content:zamestnanos^0.9090909 in 0) [PerFieldSimilarity], result of:",
         "details": [
            {
               "value": 0.03321695,
               "description": "score(doc=0,freq=6.0), product of:",
               "details": [
                  {
                     "value": 0.7507135,
                     "description": "queryWeight, product of:",
                     "details": [
                        {
                           "value": 0.9090909,
                           "description": "boost"
                        },
                        {
                           "value": 4.624341,
                           "description": "idf(docFreq=1, maxDocs=75)"
                        },
                        {
                           "value": 0.17857353,
                           "description": "queryNorm"
                        }
                     ]
                  },
                  {
                     "value": 0.044247173,
                     "description": "fieldWeight in 0, product of:",
                     "details": [
                        {
                           "value": 2.4494898,
                           "description": "tf(freq=6.0), with freq of:",
                           "details": [
                              {
                                 "value": 6,
                                 "description": "termFreq=6.0"
                              }
                           ]
                        },
                        {
                           "value": 4.624341,
                           "description": "idf(docFreq=1, maxDocs=75)"
                        },
                        {
                           "value": 0.00390625,
                           "description": "fieldNorm(doc=0)"
                        }
                     ]
                  }
               ]
            }
         ]
      }
   ]
}

Only queries "zamestnano" and "zemestnanos" are created using fuzzy query edits. 使用模糊查询编辑仅创建查询“ zamestnano”“ zemestnanos”

Do I understand the fuzzy query settings right? 我理解模糊查询设置正确吗? Could you please point out my mistake? 你能指出我的错误吗?

Thanks a lot for every idea! 非常感谢您的每一个想法!

From the documentation : 文档中

0.0..1.0 0.0..1.0

[1.7.0] Deprecated in 1.7.0. [1.7.0]在1.7.0中已弃用。 Support for similarity will be removed in Elasticsearch 2.0. 在Elasticsearch 2.0中将删除对相似性的支持。 converted into an edit distance using the formula: length(term) * (1.0 - fuzziness), eg a fuzziness of 0.6 with a term of length 10 would result in an edit distance of 4. Note: in all APIs except for the Fuzzy Like This Query, the maximum allowed edit distance is 2 . 使用以下公式将其转换为编辑距离:length(term)*(1.0-模糊度),例如,模糊度为0.6且长度为10的项将导致编辑距离为4。 注意:在所有API中,除了Fuzzy Like此查询,最大允许编辑距离为2

And the easiest way to double check this is to use the validate API: 再次检查的最简单方法是使用validate API:

GET _validate/query?explain&index=my_index
{
  "query": {
    "query_string": {
      "query": "+(content:zamestnanost~)",
      "fuzzy_prefix_length": 3,
      "fuzzy_min_sim": 0.5,
      "fuzzy_max_expansions": 50
    }
  }
}

Which gives this result: 得到以下结果:

   "explanations": [
      {
         "index": "test",
         "valid": true,
         "explanation": "+content:zamestnanost~2"
      }
   ]

which shows the actual edit distance ES will use in the query: zamestnanost~2 . 该图显示ES将在查询中使用的实际编辑距离: zamestnanost~2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM