简体   繁体   English

让ElasticSearch在结果(idf?)中的总嵌套命中数得分高于单击命中的tf?

[英]Getting ElasticSearch to score number of total nested hits across results (idf?) higher than tf of single hit?

Forgive me if I'm munging the terminology, but I am having problems getting ES to score results in a way that makes sense for my app. 如果我正在修改术语,请原谅我,但是我在让ES以对我的应用程序有意义的方式对结果进行评分时遇到了问题。

I am indexing thousands of Users with several simple fields, as well as potentially hundreds of child objects nested in the index for each user (ie the Book --> Pages data model). 我正在使用几个简单字段索引成千上万的用户,以及可能有数百个嵌套在索引中的每个用户的子对象(即Book - > Pages数据模型)。 The JSON getting sent to the index looks like this: 发送到索引的JSON如下所示:

user_id: 1
  full_name: First User
  username: firstymcfirsterton
  posts: 
   id: 2
    title: Puppies are Awesome
    tags:
     - dog house
     - dog supplies
     - dogs
     - doggies
     - hot dogs
     - dog lovers

user_id: 2
  full_name: Second User
  username: seconddude
  posts: 
   id: 3
    title: Dogs are the best
    tags:
     - dog supperiority
     - dog
   id: 4
    title: Why dogs eat?
    tags: 
     - dog diet
     - canines
   id: 5
    title: Who let the dogs out?
    tags:
     - dogs
     - terrible music

The tags are type "tags", using the "keyword" analyzer, and boosted 10. Titles are not boosted. 标签是“标签”类型,使用“关键字”分析器,并提升10.标题不会提升。

When I do a search for "dog", the first user has a higher score than the second user. 当我搜索“dog”时,第一个用户的得分高于第二个用户。 I assume this has to do the with the tf-idf of the first user being higher. 我假设这必须使用第一个用户的tf-idf更高。 However in my app, the more posts a user that have a hit for the term ideally would come first. 但是在我的应用程序中,理想情况下获得该术语命中的用户的帖子数量会更多。

I tried sorting by the number of posts, but this give junk results if the user has a lot of posts. 我尝试按帖子的数量进行排序,但如果用户有很多帖子,这会产生垃圾结果。 Basically I want to sort by number of unique post hits, such that a user who has more posts that have hits will rise to the top. 基本上我想按照独特的帖子点击次数进行排序,这样一个拥有更多帖子的用户就会登上榜首。

How would I go about doing this. 我该怎么做呢 Any ideas? 有任何想法吗?

First of all, I agree with @karmi and @Zach that it's important to figure out what you mean by matching posts. 首先,我同意@karmi和@Zach的意见,通过匹配帖子弄清楚你的意思是很重要的。 For simplicity sake, I will assume that a matching post has a word "dog" somewhere in it and we are not using keyword analyzer to make matching on tags and boosting more interesting. 为简单起见,我假设一个匹配的帖子在其中的某个地方有一个单词“dog”,我们没有使用关键字分析器来对标签进行匹配并提升更多的趣味性。

If I understood your question correctly, you want to order users based on the number of relevant posts. 如果我正确理解您的问题,您希望根据相关帖子的数量订购用户。 It means that you need to search posts in order to find relevant posts and then use this information for your user query. 这意味着您需要搜索帖子以查找相关帖子,然后将此信息用于您的用户查询。 It could be possible only if posts are indexed separately, which means posts have to be either child documents or nested fields. 只有当帖子被单独索引时才有可能,这意味着帖子必须是子文档或嵌套字段。

Assuming that posts are child documents, we could prototype your data like this: 假设帖子是子文档,我们可以像这样对数据进行原型设计:

curl -XPOST 'http://localhost:9200/test-idx' -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    },
    "mappings" : {
      "user" : {
        "_source" : { "enabled" : true },
        "properties" : {
            "full_name": { "type": "string" },
            "username": { "type": "string" }
        }
      },
      "post" : {
        "_parent" : {
          "type" : "user"
        },
        "properties" : {
            "title": { "type": "string"},
            "tags": { "type": "string", "boost": 10}
        }
      }
    }
}' && echo

curl -XPUT 'http://localhost:9200/test-idx/user/1' -d '{
    "full_name": "First User",
    "username": "firstymcfirsterton"
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/user/2' -d '{
    "full_name": "Second User",
    "username": "seconddude"
}'  && echo

#Posts of the first user
curl -XPUT 'http://localhost:9200/test-idx/post/1?parent=1' -d '{
    "title": "Puppies are Awesome",
    "tags": ["dog house", "dog supplies", "dogs", "doggies", "hot dogs", "dog lovers", "dog"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/2?parent=1' -d '{
    "title": "Cats are Awesome too",
    "tags": ["cat", "cat supplies", "cats"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/3?parent=1' -d '{
    "title": "One fine day with a woof and a purr",
    "tags": ["catdog", "cartoons"]
}'  && echo

#Posts of the second user
curl -XPUT 'http://localhost:9200/test-idx/post/4?parent=2' -d '{
    "title": "Dogs are the best",
    "tags": ["dog supperiority", "dog"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/5?parent=2' -d '{
    "title": "Why dogs eat?",
    "tags": ["dog diet", "canines"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/6?parent=2' -d '{
    "title": "Who let the dogs out?",
    "tags": ["dogs", "terrible music"]
}'  && echo

curl -XPOST 'http://localhost:9200/test-idx/_refresh' && echo

We can query these data using Top Children Query . 我们可以使用Top Children Query查询这些数据。 (Or in case of nested fields we could achieve similar results using Nested Query ) (或者在嵌套字段的情况下,我们可以使用嵌套查询实现类似的结果)

curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
  "query": {
    "top_children" : {
        "type": "post",
        "query" : {
            "bool" : {
                "should": [
                    { "text" : { "title" : "dog" } },
                    { "text" : { "tags" : "dog" } }
                ]
            }
        },
        "score" : "sum"
    }
  }
}' && echo

This query will return the first user first because of enormous boost factor that comes from matched tags. 此查询将首先返回第一个用户,因为来自匹配标记的巨大提升因子。 So, it might not look like what you want, but there are a couple of simple ways of fixing it. 所以,它可能看起来不像你想要的,但有一些简单的方法来修复它。 First, we can reduce the boost factor for the tags field. 首先,我们可以减少标签字段的提升因子。 10 is really large factor especially for the field that can be repeated several times. 10对于可以重复多次的场来说是非常大的因素。 Alternatively, we can modify the query to disregard scores of child hits completely and use the number of top matched child documents as the score instead: 或者,我们可以修改查询以完全忽略子命中的分数,并使用最匹配的子文档的数量作为分数:

curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
  "query": {
    "top_children" : {
        "type": "post",
        "query" : {
            "constant_score" : {
                "query" : {            
                    "bool" : {
                        "should": [
                            { "text" : { "title" : "dog" } },
                            { "text" : { "tags" : "dog" } }
                        ]
                    }
                }
            }
        },
        "score" : "sum"
    }
  }
}' && echo

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM