简体   繁体   English

Elasticsearch - 搜索通配符(包含在字符串中)和 tf-idf 分数

[英]Elasticsearch - search wildcards (contains in strings) and tf-idf scores

how can I make a search wildcard and tf-idf scores.如何制作搜索通配符和 tf-idf 分数。 example when I search like this,例如当我这样搜索时,

GET /test_es/_search?explain=true // return idf / dt scores
{
  "explain":true,
  "query": {
    "query_string": {
      "query": "bar^5",
      "fields"  : ["field"]
    }
  }
}

it returns idf and td score, but when I search like with wildcards (contains).它返回 idf 和 td 分数,但是当我使用通配符(包含)进行搜索时。

GET /test_es/_search?explain=true  // NOT RETURN idf/td score
{
   "explain":true,
  "query": {
    "query_string": {
      "query": "b*",
      "fields"  : ["field"]
    }
  }
}

how can I make a search with wildcards (using contains in the string) and include the IDF-TD scores?如何使用通配符进行搜索(在字符串中使用 contains)并包含 IDF-TD 分数?

for example, I have 3 documents " foo ", " foo bar ", " foo baz " when I search it like that例如,当我这样搜索时,我有 3 个文档“ foo ”、“ foo bar ”、“ foo baz

GET /foo2/_search?explain=true
{
   "explain":true,
  "query": {
    "query_string": {
      "query": "fo *",
      "fields"  : ["field"]
    }
  }
}

Elasticsearch Result Elasticsearch 结果

    "hits" : [
  {
    "_shard" : "[foo2][0]",
    "_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
    "_index" : "foo2",
    "_type" : "_doc",
    "_id" : "3",
    "_score" : 1.0,
    "_source" : {
      "field" : "foo bar"
    },
    "_explanation" : {
      "value" : 1.0,
      "description" : "sum of:",
      "details" : [
        {
          "value" : 1.0,
          "description" : "*:*",
          "details" : [ ]
        }
      ]
    }
  },
  {
    "_shard" : "[foo2][0]",
    "_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
    "_index" : "foo2",
    "_type" : "_doc",
    "_id" : "2",
    "_score" : 1.0,
    "_source" : {
      "field" : "foo"
    },
    "_explanation" : {
      "value" : 1.0,
      "description" : "sum of:",
      "details" : [
        {
          "value" : 1.0,
          "description" : "*:*",
          "details" : [ ]
        }
      ]
    }
  },
  {
    "_shard" : "[foo2][0]",
    "_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
    "_index" : "foo2",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 1.0,
    "_source" : {
      "field" : "foo baz"
    },
    "_explanation" : {
      "value" : 1.0,
      "description" : "sum of:",
      "details" : [
        {
          "value" : 1.0,
          "description" : "*:*",
          "details" : [ ]
        }
      ]
    }
  }
]

But I expect "foo" should be the first result with having the highest score because it matches %100, am I wrong?但我希望“foo”应该是第一个得分最高的结果,因为它匹配 %100,我错了吗?

Since you have not mentioned anything about data that you have taken, I have indexed the following data:由于您没有提及您所获取的数据,因此我对以下数据进行了索引:

Index data:指数数据:

{
    "message": "A fox is a wild animal."
}
{
    "message": "That fox must have killed the hen."
}
{
    "message": "the quick brown fox jumps over the lazy dog"
}

Search Query:搜索查询:

GET/{{index-name}}/_search?explain=true 

{
  "query": {
    "query_string": {
      "fields": [
        "message"                       ---> You can add more fields here
      ],
      "query": "quick^2 fox*"
    }
  }
}

The query above searches all the documents containing fox , but here since boost is applied to quick , so the document containing quick fox will have a higher score as compared to other documents.上面的查询搜索了所有包含fox的文档,但是这里由于boost应用于quick ,所以包含quick fox的文档与其他文档相比得分会更高。

This query will return the tf-IDF score.此查询将返回 tf-IDF 分数。 The boost operator is used, to make one term more relevant than another. boost 运算符用于使一个术语比另一个术语更相关。

To know more about this refer to this official documentation on "Boosting" in dsl-query-string要了解有关此的更多信息,请参阅有关dsl-query-string 中“Boosting”的官方文档

To know more about the tf-IDF algorithm you can refer to this blog要了解更多关于 tf-IDF 算法的信息,您可以参考此博客

If you want to search across multiple fields, you can boost the scores in a certain field如果要跨多个领域进行搜索,可以提高某个领域的分数

Refer this and this to know more.请参阅以了解更多信息。

Update 1:更新 1:

Index Data:指数数据:

{
  "title": "foo bar"
}
{
  "title": "foo baz"
}
{
  "title": "foo"
}

Search Query:搜索查询:

{
  "query": {
    "query_string": {
      "query": "foo *"         --> You can just add a space between 
                                   foo and *
     }
  }
}

Search Result:搜索结果:

"hits": [
         {
            "_index": "foo2",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.9808292,       --> foo matches exactly, so the 
                                           score is maximum
            "_source": {
               "title": "foo"
            }
         },
         {
            "_index": "foo2",
            "_type": "_doc",
            "_id": "2",
            "_score": 1.1234324,
            "_source": {
               "title": "foo bar"
            }
         },
         {
            "_index": "foo2",
            "_type": "_doc",
            "_id": "3",
            "_score": 1.1234324,
            "_source": {
               "title": "foo baz"
            }
         }
      ]

Update 2:更新 2:

Wildcard Queries basically falls under Term-level queries, and by default uses the constant_score_boolean method for matching terms.通配符查询基本上属于术语级别的查询,默认情况下使用constant_score_boolean方法来匹配术语。

By changing the value of the rewrite parameter you can impact search performance and relevance.通过更改rewrite 参数的值,您可以影响搜索性能和相关性。 It has various options for scoring, you can choose any of them according to your requirement.它有多种评分选项,您可以根据需要选择其中任何一种。

But according to your use case, you may also use edge_ngram filter.但根据您的用例,您也可以使用 edge_ngram 过滤器。 Edge N-Grams are useful for search-as-you-type queries. Edge N-Grams 对于搜索即键入查询很有用。 To know more about this and the mapping used below refer to this official documentation要了解有关此内容和下面使用的映射的更多信息,请参阅此官方文档

Index Mapping:索引映射:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

Index sample data:索引样本数据:

{ "title":"foo" }
{ "title":"foo bar" }
{ "title":"foo baz" }

Search Query:搜索查询:

{
  "query": {
    "match": {
      "title": {
        "query": "fo"
      }
    }
  }
}

Search Result:搜索结果:

"hits": [
            {
                "_index": "foo6",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.15965709,        --> Maximum score
                "_source": {
                    "title": "foo"
                }
            },
            {
                "_index": "foo6",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.12343237,
                "_source": {
                    "title": "foo bar"
                }
            },
            {
                "_index": "foo6",
                "_type": "_doc",
                "_id": "3",
                "_score": 0.12343237,
                "_source": {
                    "title": "foo baz"
                }
            }
        ]

To know more about basics of using Ngrams in Elasticsearch you can referthis要了解更多关于在 Elasticsearch 中使用 Ngram 的基础知识,您可以参考这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM