Optimize ES query with too many terms elements

Question

We are processing a dataset of billions of records, currently all of the data are saved in ElasticSearch, and all of the queries and aggregations are performed with ElasticSearch.

The simplified query body is like below, we put the device ids in terms and then concate them with should to avoid the limit of 1024 to terms , the total count of terms element is up to 100,000, and it now becomes very slow.

{
"_source": {
    "excludes": [
        "raw_msg"
    ]
},
"query": {
        "filter": {
            "bool": {
                "must": [
                    {
                        "range": {
                            "create_ms": {
                                "gte": 1664985600000,
                                "lte": 1665071999999
                            }
                        }
                    }
                ],
                "should": [
                    {
                        "terms": {
                            "device_id": [
                                "1328871",
                                "1328899",
                                "1328898",
                                "1328934",
                                "1328919",
                                "1328976",
                                "1328977",
                                "1328879",
                                "1328910",
                                "1328902",
                                ...       # more values, since terms not support values more than 1024, wen concate all of them with should
                            ]
                        }
                    },
                    {
                        "terms": {
                            "device_id": [
                                "1428871",
                                "1428899",
                                "1428898",
                                "1428934",
                                "1428919",
                                "1428976",
                                "1428977",
                                "1428879",
                                "1428910",
                                "1428902",
                                ...
                            ]
                        }
                    },
                    ...  # concate more terms until all of the 100,000 values are included
                ],
                "minimum_should_match": 1
            }
        }
},
"aggs": {
    "create_ms": {
        "date_histogram": {
            "field": "create_ms",
            "interval": "hour",
        }
    }
},
"size": 0}

My question is that is there a way to optimize this case? Or is there a better choice to do this kind of search?

Realtime or near realtime is a must, other engine is acceptable.

simplified schema of the data:

    "id" : {
        "type" : "long"
    },
    "content" : {
        "type" : "text"
    },
    "device_id" : {
        "type" : "keyword"
    },
    "create_ms" : {
        "type" : "date"
    },
    ... # more field

Answer 1

You can use the terms query with a terms lookup to specify a larger list of values like here

Store your ids in a specific document with id like 'device_ids'

"should": [
  {
    "terms": {
      "device_id": {
        "index": "your-index-name",
        "id": "device_ids",
        "path": "field-name"
      }
    }
  }
]

Optimize ES query with too many terms elements

Question

1 answers

solution1
0 2022-12-10 17:43:05

Optimize ES query with too many terms elements

Question

1 answers

solution1 0 2022-12-10 17:43:05

solution1
0 2022-12-10 17:43:05