使用过多的术语元素优化 ES 查询

Question

We are processing a dataset of billions of records, currently all of the data are saved in ElasticSearch, and all of the queries and aggregations are performed with ElasticSearch.我们正在处理一个数十亿记录的数据集，目前所有的数据都保存在 ElasticSearch 中，所有的查询和聚合都是用 ElasticSearch 进行的。

The simplified query body is like below, we put the device ids in terms and then concate them with should to avoid the limit of 1024 to terms , the total count of terms element is up to 100,000, and it now becomes very slow.简化的查询体如下，我们把设备id放在terms中，然后用should拼接，避免1024个terms的限制，terms元素的总数达到100,000，现在变得很慢。

{
"_source": {
    "excludes": [
        "raw_msg"
    ]
},
"query": {
        "filter": {
            "bool": {
                "must": [
                    {
                        "range": {
                            "create_ms": {
                                "gte": 1664985600000,
                                "lte": 1665071999999
                            }
                        }
                    }
                ],
                "should": [
                    {
                        "terms": {
                            "device_id": [
                                "1328871",
                                "1328899",
                                "1328898",
                                "1328934",
                                "1328919",
                                "1328976",
                                "1328977",
                                "1328879",
                                "1328910",
                                "1328902",
                                ...       # more values, since terms not support values more than 1024, wen concate all of them with should
                            ]
                        }
                    },
                    {
                        "terms": {
                            "device_id": [
                                "1428871",
                                "1428899",
                                "1428898",
                                "1428934",
                                "1428919",
                                "1428976",
                                "1428977",
                                "1428879",
                                "1428910",
                                "1428902",
                                ...
                            ]
                        }
                    },
                    ...  # concate more terms until all of the 100,000 values are included
                ],
                "minimum_should_match": 1
            }
        }
},
"aggs": {
    "create_ms": {
        "date_histogram": {
            "field": "create_ms",
            "interval": "hour",
        }
    }
},
"size": 0}

My question is that is there a way to optimize this case?我的问题是有没有办法优化这个案例？ Or is there a better choice to do this kind of search?还是有更好的选择来进行这种搜索？

Realtime or near realtime is a must, other engine is acceptable.实时或接近实时是必须的，其他引擎也是可以接受的。

simplified schema of the data:数据的简化模式：

    "id" : {
        "type" : "long"
    },
    "content" : {
        "type" : "text"
    },
    "device_id" : {
        "type" : "keyword"
    },
    "create_ms" : {
        "type" : "date"
    },
    ... # more field

Answer 1

You can use the terms query with a terms lookup to specify a larger list of values like here您可以使用带有术语查找的术语查询来指定更大的值列表，如下所示

Store your ids in a specific document with id like 'device_ids'将您的 ID 存储在特定文档中，ID 如“device_ids”

"should": [
  {
    "terms": {
      "device_id": {
        "index": "your-index-name",
        "id": "device_ids",
        "path": "field-name"
      }
    }
  }
]

使用过多的术语元素优化 ES 查询

问题描述

1 个解决方案

解决方案1
0 2022-12-10 17:43:05

使用过多的术语元素优化 ES 查询

问题描述

1 个解决方案

解决方案1 0 2022-12-10 17:43:05

解决方案1
0 2022-12-10 17:43:05