[英]How to get sum of score similar tags from text in elastic search
I try to use Elastic Search
(version 6.8) to find most similar tags from text, and i expect to get sum of score similar tags instead of default elastic search's calculation (formula).我尝试使用Elastic Search
(6.8 版)从文本中查找最相似的标签,我希望得到分数相似标签的总和,而不是默认的弹性搜索计算(公式)。
For example, i create my_test_index and insert three documents:例如,我创建 my_test_index 并插入三个文档:
POST my_test_index/_doc/17
{
"id": 17,
"tags": ["devops", "server", "hardware"]
}
POST my_test_index/_doc/20
{
"id": 20,
"tags": ["software", "application", "developer", "develop"]
}
POST my_test_index/_doc/21
{
"id": 21,
"tags": ["electronic", "electric"]
}
There is no mapping, it's default as bellow:没有映射,默认如下:
{
"my_test_index" : {
"aliases" : { },
"mappings" : {
"_doc" : {
"properties" : {
"id" : {
"type" : "long"
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1585820383702",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "05SgLog6S-GTSShTatrvQw",
"version" : {
"created" : "6080199"
},
"provided_name" : "my_test_index"
}
}
}
}
So, I request below query:所以,我请求以下查询:
GET my_test_index/_search
{
"query": {
"more_like_this": {
"fields": [
"tags"
],
"like": [
"i like electric devices and develop some softwares."
],
"min_term_freq": 1,
"min_doc_freq": 1
}
}
}
And get this response:并得到这个回应:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.2876821,
"_source" : {
"id" : 21,
"tags" : [
"electronic",
"electric"
]
}
},
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.2876821,
"_source" : {
"id" : 20,
"tags" : [
"software",
"application",
"developer",
"develop"
]
}
}
]
}
}
If i set explain:true, result is:如果我设置了解释:真,结果是:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_shard" : "[my_test_index][1]",
"_node" : "maQL1REnQHaff51ekrqMxA",
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.2876821,
"_source" : {
"id" : 21,
"tags" : [
"electronic",
"electric"
]
},
"_explanation" : {
"value" : 0.2876821,
"description" : "weight(tags:electric in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
},
{
"_shard" : "[my_test_index][2]",
"_node" : "maQL1REnQHaff51ekrqMxA",
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.2876821,
"_source" : {
"id" : 20,
"tags" : [
"software",
"application",
"developer",
"develop"
]
},
"_explanation" : {
"value" : 0.2876821,
"description" : "weight(tags:develop in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
}
]
}
}
But, it's not appropriate result for me, i want to calculate sum of score similar tags like below: I have " electric " word in text and tags and equal to " electric " tag, it gets 1.0 score and similarity to " electrical " tag, it gets ~0.7 score.但是,这对我来说不是合适的结果,我想计算得分类似标签的总和,如下所示:我在文本和标签中有“ electric ”字样并且等于“ electric ”标签,它得到 1.0 分并且与“ electric ”标签相似,得到 ~0.7 分。 And " develop " word in text and tags, equal to " develop " tag, it gets 1.0 score, similarity to " developer " tag, it gets ~0.8 score and similarity to " softwares " it gets ~0.9 score, and so on ...文本和标签中的“开发”一词,等于“开发”标签,得分为1.0,与“开发者”标签的相似度为~0.8,与“软件”的相似度为~0.9,以此类推。 ..
So, I expect this result==> sum score of _id:20 is= ~2.7, _id:21= ~1.7 and ....所以,我希望这个结果 ==> _id:20 的总分是 = ~2.7,_id:21= ~1.7 和 ....
I was hoping someone can provide an example on how to do this or at least point me in the right direction.我希望有人可以提供一个关于如何做到这一点的例子,或者至少为我指明正确的方向。
Thanks.谢谢。
I think you are not using the text
field for tags
field in your mapping, which is causing both ids 20
and 21
to have the same score, I defined it as text
in my mapping and got high score for id 21
which is expected.我认为您没有在映射中将text
字段用于tags
字段,这导致 id 20
和21
具有相同的分数,我在映射中将其定义为text
,并在 id 21
获得了预期的高分。
below is my solution.下面是我的解决方案。
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"tags" : {
"type" : "text" --> note this
}
}
}
}
Indexed sample docs as you provided and using the same search query .您提供的索引示例文档并使用相同的搜索查询。
{
"query": {
"more_like_this": {
"fields": [
"tags"
],
"like": [
"i like electric devices and develop some softwares."
],
"min_term_freq": 1,
"min_doc_freq": 1
}
}
}
"hits": [
{
"_index": "so_array",
"_type": "_doc",
"_id": "3",
"_score": 1.135697, --> note score
"_source": {
"id": 21,
"tags": [
"electronic",
"electric"
]
}
},
{
"_index": "so_array",
"_type": "_doc",
"_id": "2",
"_score": 0.86312973, --> note score
"_source": {
"id": 20,
"tags": [
"software",
"application",
"developer",
"develop"
]
}
}
]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.