简体   繁体   English

Elasticsearch搜索的结果多种多样

[英]Diversified results on Elasticsearch search

I've done a complex query using the popularity to improve the results of social media documents using Elasticsearch. 我使用流行度完成了一个复杂的查询,以使用Elasticsearch改进社交媒体文档的结果。 The query works really fine and the top results are always centered on the query and with interesting elements. 查询工作得非常好,最重要的结果始终集中在查询和有趣的元素上。

However it has a problem, for some queries the first results are all from the same user . 但是它有一个问题,对于某些查询,第一个结果都来自同一个用户

I would like to downscore a document if same user was retrieved on a higher document. 如果在更高的文档上检索到相同的用户,我想缩减文档。 This way I expect to have more diversification on the results. 这样我希望结果更加多样化。

Note that I don't want them to be removed, as in some cases it may still be interesting to find more documents of the same user, but I would like them to be in a lower position. 请注意,我不希望它们被删除,因为在某些情况下,查找同一用户的更多文档可能仍然很有趣,但我希望它们处于较低的位置。

Can anybody suggest a way to make it work? 任何人都可以建议一种方法来使它工作吗?


As suggested in some comments I update a (simplified version) of my query: 正如一些评论中所建议的,我更新了我的查询(简化版):

query = {"function_score": {
  "functions": [
    {"gauss": {"createdAt":
        {"origin": "now", "scale": "30d", "offset": "7d", "decay" :0.9 } 
    }},
    {"gauss": {"shares.last.twitter_retweets_log":
        {"origin": 4.52, "scale": 2.61, "decay" : 0.9} 
    }},
  ],
  "query": {"bool":{"must":[
    {"exists":{"field": "images"}},
    {"multi_match":{"query": "foo boo", fields:["text", "link.title"]}}
  ]}},
  "score_mode": "multiply"
}};

PS: some documents that may be interesting, as they talk about diversity, but I'm not sure how to apply: PS:一些可能有趣的文件,因为他们谈论多样性,但我不知道如何申请:

You can couple the sampler with the top_hits aggregation to get diversified results. 您可以将采样器与top_hits聚合耦合以获得多样化的结果。

{
    "query": {
        "match": {
            "query": "iphone"
        }
    },
    "size":0,
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200,
                "field" : "user.id"                
            },
            "aggs": {
                "diversifiedMatches": {
                    "top_hits": {
                        "size":10
                    }
                }
            }
        }
    }
}

There are some caveats eg: 有一些警告,例如:

1) Deduplication is per-shard not global 1)重复数据删除是每个分片不是全局的

2) Choice of diversification field must be a single-value field 2)多样化领域的选择必须是单值领域

3) No support for pagination 3)不支持分页

4) No support for sorting on anything other than score 4)不支持对除分数以外的任何内容进行排序

Addressing the above issues would be hard and would require expensive/complex co-ordination internally plus more guidance from the client about when and where "duplicate" results can be re-introduced (page 2? page 3? how many?) etc. 解决上述问题很困难,需要内部昂贵/复杂的协调,以及客户关于何时何地可以重新引入“重复”结果的更多指导(第2页?第3页?有多少?)等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM