简体   繁体   English

如何从弹性搜索中的文本中获取得分相似标签的总和

[英]How to get sum of score similar tags from text in elastic search

I try to use Elastic Search (version 6.8) to find most similar tags from text, and i expect to get sum of score similar tags instead of default elastic search's calculation (formula).我尝试使用Elastic Search (6.8 版)从文本中查找最相似的标签,我希望得到分数相似标签的总和,而不是默认的弹性搜索计算(公式)。

For example, i create my_test_index and insert three documents:例如,我创建 my_test_index 并插入三个文档:

POST my_test_index/_doc/17
{
  "id": 17,
  "tags": ["devops", "server", "hardware"]
}

POST my_test_index/_doc/20
{
  "id": 20,
  "tags": ["software", "application", "developer", "develop"]
}

POST my_test_index/_doc/21
{
  "id": 21,
  "tags": ["electronic", "electric"]
}

There is no mapping, it's default as bellow:没有映射,默认如下:

{
  "my_test_index" : {
    "aliases" : { },
    "mappings" : {
      "_doc" : {
        "properties" : {
          "id" : {
            "type" : "long"
          },
          "tags" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1585820383702",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "05SgLog6S-GTSShTatrvQw",
        "version" : {
          "created" : "6080199"
        },
        "provided_name" : "my_test_index"
      }
    }
  }
}

So, I request below query:所以,我请求以下查询:

GET my_test_index/_search
{
  "query": {
    "more_like_this": {
      "fields": [
        "tags"
      ],
      "like": [
        "i like electric devices and develop some softwares."
      ],
      "min_term_freq": 1,
      "min_doc_freq": 1
    }
  }
}

And get this response:并得到这个回应:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "my_test_index",
        "_type" : "_doc",
        "_id" : "21",
        "_score" : 0.2876821,
        "_source" : {
          "id" : 21,
          "tags" : [
            "electronic",
            "electric"
          ]
        }
      },
      {
        "_index" : "my_test_index",
        "_type" : "_doc",
        "_id" : "20",
        "_score" : 0.2876821,
        "_source" : {
          "id" : 20,
          "tags" : [
            "software",
            "application",
            "developer",
            "develop"
          ]
        }
      }
    ]
  }
}

If i set explain:true, result is:如果我设置了解释:真,结果是:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_shard" : "[my_test_index][1]",
        "_node" : "maQL1REnQHaff51ekrqMxA",
        "_index" : "my_test_index",
        "_type" : "_doc",
        "_id" : "21",
        "_score" : 0.2876821,
        "_source" : {
          "id" : 21,
          "tags" : [
            "electronic",
            "electric"
          ]
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "weight(tags:electric in 0) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "docFreq",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "docCount",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 1.0,
                  "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "termFreq=1.0",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "parameter k1",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "parameter b",
                      "details" : [ ]
                    },
                    {
                      "value" : 2.0,
                      "description" : "avgFieldLength",
                      "details" : [ ]
                    },
                    {
                      "value" : 2.0,
                      "description" : "fieldLength",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[my_test_index][2]",
        "_node" : "maQL1REnQHaff51ekrqMxA",
        "_index" : "my_test_index",
        "_type" : "_doc",
        "_id" : "20",
        "_score" : 0.2876821,
        "_source" : {
          "id" : 20,
          "tags" : [
            "software",
            "application",
            "developer",
            "develop"
          ]
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "weight(tags:develop in 0) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "docFreq",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "docCount",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 1.0,
                  "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "termFreq=1.0",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "parameter k1",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "parameter b",
                      "details" : [ ]
                    },
                    {
                      "value" : 4.0,
                      "description" : "avgFieldLength",
                      "details" : [ ]
                    },
                    {
                      "value" : 4.0,
                      "description" : "fieldLength",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

But, it's not appropriate result for me, i want to calculate sum of score similar tags like below: I have " electric " word in text and tags and equal to " electric " tag, it gets 1.0 score and similarity to " electrical " tag, it gets ~0.7 score.但是,这对我来说不是合适的结果,我想计算得分类似标签的总和,如下所示:我在文本和标签中有“ electric ”字样并且等于“ electric ”标签,它得到 1.0 分并且与“ electric ”标签相似,得到 ~0.7 分。 And " develop " word in text and tags, equal to " develop " tag, it gets 1.0 score, similarity to " developer " tag, it gets ~0.8 score and similarity to " softwares " it gets ~0.9 score, and so on ...文本和标签中的“开发”一词,等于“开发”标签,得分为1.0,与“开发者”标签的相似度为~0.8,与“软件”的相似度为~0.9,以此类推。 ..

So, I expect this result==> sum score of _id:20 is= ~2.7, _id:21= ~1.7 and ....所以,我希望这个结果 ==> _id:20 的总分是 = ~2.7,_id:21= ~1.7 和 ....

I was hoping someone can provide an example on how to do this or at least point me in the right direction.我希望有人可以提供一个关于如何做到这一点的例子,或者至少为我指明正确的方向。

Thanks.谢谢。

I think you are not using the text field for tags field in your mapping, which is causing both ids 20 and 21 to have the same score, I defined it as text in my mapping and got high score for id 21 which is expected.我认为您没有在映射中将text字段用于tags字段,这导致 id 2021具有相同的分数,我在映射中将其定义为text ,并在 id 21获得了预期的高分。

below is my solution.下面是我的解决方案。

Index def索引定义

{
    "mappings": {
        "properties": {
            "id": {
                "type": "integer"
            },
            "tags" : {
                "type" : "text" --> note this
            }
        }
    }
}

Indexed sample docs as you provided and using the same search query .您提供的索引示例文档并使用相同的搜索查询

Search query搜索查询

{
  "query": {
    "more_like_this": {
      "fields": [
        "tags"
      ],
      "like": [
        "i like electric devices and develop some softwares."
      ],
      "min_term_freq": 1,
      "min_doc_freq": 1
    }
  }
}

Search result搜索结果

 "hits": [
         {
            "_index": "so_array",
            "_type": "_doc",
            "_id": "3",
            "_score": 1.135697, --> note score
            "_source": {
               "id": 21,
               "tags": [
                  "electronic",
                  "electric"
               ]
            }
         },
         {
            "_index": "so_array",
            "_type": "_doc",
            "_id": "2",
            "_score": 0.86312973, --> note score
            "_source": {
               "id": 20,
               "tags": [
                  "software",
                  "application",
                  "developer",
                  "develop"
               ]
            }
         }
      ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM