[英]How to quickly aggregate large amount of data
I need to aggregate all the keywords in the news for a period of time, for example:我需要聚合一段时间内新闻中的所有关键词,例如:
{
"news_ID":"123456",
"news_content":"Apple pencil",
"keywords": {
[
{
"word" : "Apple",
"score" : 0.0653220043
},
{
"word" : "pencil",
"score" : 0.7096893191
}
]
},
"publish_time":"2020-01-03"
}
I want to know how many times apple
appeared between 2020-01 and 2020-02, but there are too many keywords...我想知道
apple
在2020-01到2020-02之间出现了多少次,但是关键词太多了...
Could you please advise me on how I should approach this requirement as per best practices?您能否就我应该如何根据最佳实践来处理此要求提出建议?
Syncing a sample doc:同步示例文档:
PUT tester/_doc/1
{
"news_ID":"123456",
"news_content":"Apple pencil",
"keywords":[
"apple",
"pencil"
],
"publish_time":"2020-01-03"
}
Using a terms aggregation w/ a range filter on the top level:在顶层使用带有范围过滤器的术语聚合:
GET tester/_search
{
"size": 0,
"query": {
"range": {
"publish_time": {
"gte": "2020-01-01",
"lt": "2020-02-01"
}
}
},
"aggs": {
"by_keywords": {
"terms": {
"field": "keywords.keyword"
}
}
}
}
You can also use a filtered aggregation to aggregate on multiple monthly buckets:您还可以使用过滤聚合来聚合多个月度存储桶:
GET tester/_search
{
"size": 0,
"aggs": {
"2020-01_2020-02": {
"filter": {
"range": {
"publish_time": {
"gte": "2020-01-01",
"lt": "2020-02-01"
}
}
},
"aggs": {
"by_keywords": {
"terms": {
"field": "keywords.keyword"
}
}
}
},
"2020-02_2020-03": {
"filter": {
"range": {
"publish_time": {
"gte": "2020-02-01",
"lt": "2020-03-01"
}
}
},
"aggs": {
"by_keywords": {
"terms": {
"field": "keywords.keyword"
}
}
}
}
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.