简体   繁体   English

在Elasticsearch中按最接近的匹配得分

[英]Score by closest match in Elasticsearch

I have a Elasticsearch::Model on a ActiveRecord::Base model that looks like this 我在ActiveRecord::Base模型上有一个Elasticsearch::Model ,看起来像这样

class ArtistGroup < ActiveRecord::Base
  include Elasticsearch::Model
  include Elasticsearch::Model::Callbacks

  FT_REDIS_KEY = "agft"
  has_many :artists

  settings index: { number_of_shards: 5 } do
    mappings dynamic: 'false' do
      indexes :normalized_name, analyzer: 'english'
      indexes :name, analyzer: 'english'
    end
  end

  def as_indexed_json(options={})
    as_json(only: ['normalized_name', 'id', 'name'])
  end
....

When I search by .search('haim') I want the document with name: "Haim" to be returned first before others like "Danielle Haim of Haim", how can I control ES querying to score by closest match? 当我通过.search('haim')搜索时,我希望首先返回名称为“ Haim”的文档,然后才返回诸如“ Haim的Danielle Haim”之类的文件,如何控制ES查询以最接近的匹配得分?

Elasticsearch returns by default the results sorted by relevance (ie the score of each document). Elasticsearch默认返回按相关性排序的结果(即每个文档的分数)。

The way that this score is calculated is based on a set of basic rules combined with some query-specific rules. 计算此分数的方式基于一组基本规则以及某些特定于查询的规则。

The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account: Elasticsearch中使用的标准相似度算法称为术语频率/逆文档频率,或TF / IDF,它考虑以下因素:

  • Term frequency: How often does the term appear in the field? 学期频率:该学期在该领域出现的频率如何? The more often, the more relevant. 越频繁,越相关。 A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. 包含同一术语的五个提及的字段比仅包含一个提及的字段更可能相关。
  • Inverse document frequency: How often does each term appear in the index? 反向文档频率:每个术语在索引中出现的频率是多少? The more often, the less relevant. 频率越高,相关性越低。 Terms that appear in many documents have a lower weight than more-uncommon terms. 与不常见的术语相比,许多文档中出现的术语的权重较低。
  • Field-length norm: How long is the field? 场长规范:场地有多长? The longer it is, the less likely it is that words in the field will be relevant. 它越长,该领域中的单词相关的可能性就越小。 A term appearing in a short title field carries more weight than the same term appearing in a long content field. 出现在短标题字段中的术语比在长内容字段中出现的相同术语具有更多权重。

Individual queries may combine the TF/IDF score with other factors such as the term proximity in phrase queries, or term similarity in fuzzy queries. 单个查询可以将TF / IDF分数与诸如短语查询中的术语接近度或模糊查询中的术语相似性之类的其他因素组合。

For a complete description of relevance please refer here: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/sorting.html 有关相关性的完整说明,请参阅此处: http//www.elasticsearch.org/guide/en/elasticsearch/guide/current/sorting.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM