简体   繁体   English

Elasticsearch query_string通配符不考虑长度

[英]Elasticsearch query_string wildcard does not consider length

I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd. 我在Elasticsearch上有一些记录,它们的首字母相同,例如:word,worda,wordab,wordabc,wordabcd。

I am using query_string with a wildcard: 我使用带通配符的query_string:

"query": {
  "bool":{
    "must":[
      {
        "query_string":{
          "query":"word*"
        }
      }
    ]
  }
}

All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. 所有匹配都具有相同的分数(“ _score”:1.0),因此顺序是任意的。 Is it possible to have a score considering how much the word actually matches the term? 是否有可能考虑到该词与该词实际匹配的分数? For instance, word matches the term 100%, worda matches the term 80%, and so on. 例如,单词匹配术语100%,单词匹配术语80%,依此类推。

The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms) 您为所有匹配的文档获得1分的原因如下-通配符/前缀查询是多词查询,为了执行它们,Elasticsearch需要进行重写(以获取实际的匹配词)

There are several ways to achieve this, the default one is called constant_score which assigned all constant scores (ones) 有多种方法可以实现此目的,默认方法称为constant_score ,它分配了所有恒定分数(一个)

There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (eg how often worda is happening in the matched document and how many documents in whole index contains worda ). 有几种不同的重写方式 -其中一些会产生不相等的分数,但是这种评分将取决于术语的TF-IDF分布(例如,匹配文档中的单词出现频率以及整个索引中有多少文档包含worda )。 As a first starting way you could try top_terms_1000 , tweaking it later. 作为第一种开始方式,您可以尝试top_terms_1000 ,然后进行调整。

Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour. 不幸的是,没有开箱即用的完美方法来实现预期的行为。

One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following: 模仿它的一种可能方法是尝试改编Edge NGram令牌生成器以从wordabc生成令牌,如下所示:

w, wo, wor, word, ...

In this case querying could produce more meaningful score. 在这种情况下,查询可以产生更有意义的分数。 For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism 为了获得理想的预期结果(匹配百分比),您需要创建自定义查询和评分机制

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM