简体   繁体   English

基于内容相似度的Elasticsearch固定分数

[英]Elasticsearch fixed score based on content similarity

I am working on a tool to identify similar documents and mark them as duplicated. 我正在使用一种工具来识别相似的文档并将其标记为重复。

To do so, I am using ElasticSearch to check on the documents content so that ElasticSearch take care of managing the synomns and possible typos, however I haven't got to come up with a query that would reach my goals. 为此,我正在使用ElasticSearch来检查文档内容,以便ElasticSearch负责管理语法和可能的错字,但是我没有想出可以达到我的目标的查询。

So far I came up with this query: 到目前为止,我想到了以下查询:

{
 "query":{
    "filtered":{
       "query":{
          "more_like_this":{
             "fields":[
                "description"
             ],
             "like_text":"Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
             "min_term_freq":1,
             "max_query_terms":999,
             "min_doc_freq":1
          }
       }
    }
 },
 "from":0,
 "size":999,
 "search_type": "dfs_query_then_fetch",
 "sort":[
    {
       "_score":{
          "order":"desc"
       }
    }
 ]
}

But it seems like the score it gives me is quite random, I would like to have a score like 100 for contents completely equal while 0 for something that is completely different. 但是似乎它给我的分数是非常随机的,对于完全相同的内容,我想得到一个像100的分数,而对于完全不同的东西,我想得到一个0的分数。

I see where you are going, but out of the box, the scoring is only going to be relevant for that particular query because it is all based on term frequencies and position. 我可以看到您要去的地方,但是开箱即用,评分仅与该特定查询相关,因为评分都是基于词频和位置。 so the score is great for results for that query, but meaningless from query to query. 因此,该分数对于该查询的结果来说是很好的,但对每个查询来说却毫无意义。 So, I would simply wrap that in a constant score query. 因此,我只是将其包装在恒定分数查询中。

If you would be down for putting each term in its own query, I can provide an example of possibly solving this with multiple constant scores ina bool query inside another bool query. 如果您不愿意将每个术语放在自己的查询中,那么我可以提供一个示例,该示例可能会在另一个布尔查询中的布尔查询中使用多个恒定分数来解决这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM