简体   繁体   English

如何在elastic.js(elasticsearch)中计算具有相同值的字段?

[英]How to count of fields with the same value in elastic.js (elasticsearch)?

I have a list of communities. 我有一个社区清单。 And I need to create aggregation query which will count all data which have the same title. 我需要创建一个聚合查询,该查询将对所有具有相同标题的数据进行计数。

[
  {
    "_id": "56161cb3cbdad2e3b437fdc3",
    "_type": "Comunity",
    "name": "public",
    "data": [
      {
        "title": "sonder",
        "creationDate": "2015-08-22T03:43:28 -03:00",
        "quantity": 0
      },
      {
        "title": "vule",
        "creationDate": "2014-05-17T12:35:01 -03:00",
        "quantity": 0
      },
      {
        "title": "omer",
        "creationDate": "2015-01-31T04:54:19 -02:00",
        "quantity": 3
      },
      {
        "title": "sonder",
        "creationDate": "2014-05-22T05:09:36 -03:00",
        "quantity": 3
      }
    ]
  },
  {
    "_id": "56161cb3dae30517fc133cd9",
    "_type": "Comunity",
    "name": "static",
    "data": [
      {
        "title": "vule",
        "creationDate": "2014-07-01T06:32:06 -03:00",
        "quantity": 5
      },
      {
        "title": "vule",
        "creationDate": "2014-01-10T12:40:28 -02:00",
        "quantity": 1
      },
      {
        "title": "vule",
        "creationDate": "2014-01-09T09:33:11 -02:00",
        "quantity": 3
      }
    ]
  },
  {
    "_id": "56161cb32f62b522355ca3c8",
    "_type": "Comunity",
    "name": "public",
    "data": [
      {
        "title": "vule",
        "creationDate": "2014-02-03T09:55:28 -02:00",
        "quantity": 2
      },
      {
        "title": "vule",
        "creationDate": "2015-01-23T09:14:22 -02:00",
        "quantity": 0
      }
    ]
  }
]

So desire result should be 所以欲望的结果应该是

[
  {
    title: vule,
    total: 6
  },
  {
    title: omer,
    total: 1
  },
  {
    title: sonder,
    total: 1
  }
]

I wrote some aggregation queries but it still not work. 我写了一些聚合查询,但仍然无法正常工作。 How can I get desire result? 如何获得欲望结果?

PS: I tried to create nested aggregation PS:我试图创建嵌套聚合

ejs.Request().size(0).agg(
        ejs.NestedAggregation('comunities')
            .path('data')
            .agg(
                ejs.FilterAggregation('sonder')
                    .filter(
                    ejs.TermsFilter('data.title', 'sonder')
                ).agg(
                ejs.ValueCountAggregation('counts')
                      .field('data.title')
)
            )
    );

You need to use terms aggregations. 您需要使用术语聚合。

Now depending on your mapping there could be two ways of doing that: 现在,取决于您的映射,可能有两种方法可以执行此操作:

1. Your data field is stored as an subdocument 1.您的数据字段存储为子文档

You need to run a simple terms aggregation, which in RAW json looks like: 您需要运行一个简单的术语聚合,在RAW json中如下所示:

POST /test/test/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "Grouping": {
      "terms": {
        "field": "data.title",
        "size": 0
      }
    }
  }
}

2. Your data field is stored as an nested document 2.您的数据字段存储为嵌套文档

You have to add a nested subaggregation before doing terms aggregation. 您必须先进行嵌套子聚合,然后再进行术语聚合。

POST /test/test/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "Nest": {
      "nested": {
        "path": "data"
      },
      "aggs": {
        "Grouping": {
          "terms": {
            "field": "data.title",
            "size": 0
          }
        }
      }
    }
  }
}

Both will output this: 两者都将输出以下内容:

{
   "took": 125,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "Nest": {
         "doc_count": 9,
         "Grouping": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "vule",
                  "doc_count": 6            -- The Total count you're looking for
               },
               {
                  "key": "sonder",
                  "doc_count": 2
               },
               {
                  "key": "omer",
                  "doc_count": 1
               }
            ]
         }
      }
   }
}

This, unfortunately, is just a raw query, but I imagine that it can be translated into elastic.js quite easily. 不幸的是,这只是一个原始查询,但是我想它可以很容易地转换为elastic.js

On top of that. 最重要的是。 If you're going to do aggregations, don't forget to set your fields, that you're doing aggregations on, as not_analyzed , because it will start counting individual tokens as in documentation 如果要进行聚合,请不要忘记将要进行聚合的字段设置为not_analyzed ,因为它将像文档中那样开始计算单个令牌

I, myself, would store these documens as nested ones. 我本人会将这些文档存储为嵌套文档。

Example: 例:

Mappings: 映射:

PUT /test
{
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string"
        },
        "data": {
          "type": "nested",
          "properties": {
            "title": {
              "type": "string",
              "index": "not_analyzed",
              "fields": {
                "stemmed": {
                  "type": "string",
                  "analyzed": "standard"
                }
              }
            },
            "creationDate": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "quantity": {
              "type": "integer"
            }
          }
        }
      }
    }
  }
}

Test data: 测试数据:

PUT /test/test/56161cb3cbdad2e3b437fdc3
{
  "name": "public",
  "data": [
    {
      "title": "sonder",
      "creationDate": "2015-08-22T03:43:28",
      "quantity": 0
    },
    {
      "title": "vule",
      "creationDate": "2014-05-17T12:35:01",
      "quantity": 0
    },
    {
      "title": "omer",
      "creationDate": "2015-01-31T04:54:19",
      "quantity": 3
    },
    {
      "title": "sonder",
      "creationDate": "2014-05-22T05:09:36",
      "quantity": 3
    }
  ]
}

PUT /test/test/56161cb3dae30517fc133cd9
{
  "name": "static",
  "data": [
    {
      "title": "vule",
      "creationDate": "2014-07-01T06:32:06",
      "quantity": 5
    },
    {
      "title": "vule",
      "creationDate": "2014-01-10T12:40:28",
      "quantity": 1
    },
    {
      "title": "vule",
      "creationDate": "2014-01-09T09:33:11",
      "quantity": 3
    }
  ]
}

PUT /test/test/56161cb32f62b522355ca3c8
{
  "name": "public",
  "data": [
    {
      "title": "vule",
      "creationDate": "2014-02-03T09:55:28",
      "quantity": 2
    },
    {
      "title": "vule",
      "creationDate": "2015-01-23T09:14:22",
      "quantity": 0
    }
  ]
}

Actual query: 实际查询:

POST /test/test/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "Nest": {
      "nested": {
        "path": "data"
      },
      "aggs": {
        "Grouping": {
          "terms": {
            "field": "data.title",
            "size": 0
          }
        }
      }
    }
  }
}

PS "size":0 means that I'm letting Elasticsearch output all possible terms and not limiting its output as described in documentation . PS "size":0表示我让Elasticsearch输出所有可能的术语,而不是按文档中所述限制其输出。

The size parameter can be set to define how many term buckets should be returned out of the overall terms list. 可以设置size参数来定义应从整体条件列表中返回多少个条件桶。 By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. 默认情况下,协调搜索过程的节点将请求每个分片提供其自己的最大size术语存储桶,并且一旦所有分片都做出响应,它将把结果缩减为最终列表,然后将其返回给客户端。 This means that if the number of unique terms is greater than size , the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned). 这意味着,如果唯一术语的数量大于size ,则返回的列表会略有偏离并且不准确(这可能是术语计数略有偏离,甚至可能是应该位于顶部的术语桶未归还)。 If set to 0 , the size will be set to Integer.MAX_VALUE . 如果设置为0 ,则size将设置为Integer.MAX_VALUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM