How to limit elasticsearch to a list of documents each identified by a unique keyword

Question

I have an elasticsearch document repository with ~15M documents.

Each document has an unique 11-char string field (comes from a mongo DB) that is unique to the document. This field is indexed as keyword .

I'm using C#.

When I run a search, I want to be able to limit the search to a set of documents that I specify (via some list of the unique field ids).

My query text uses bool with must to supply a filter for the unique identifiers and additional clauses to actually search the documents. See example below.

To search a large number of documents, I generate multiple query strings and run them concurrently. Each query handles up to 64K unique ids (determined by the limit on terms ).

In this case, I have 262,144 documents to search (list comes, at run time, from a separate mongo DB query). So my code generates 4 query strings (see example below).

I run them concurrently.

Unfortunately, this search takes over 22 seconds to complete.

When I run the same search but drop the terms node (so it searches all the documents), a single such query completes the search in 1.8 seconds.

An incredible difference.

So my question: Is there an efficient way to specify which documents are to be searched (when each document has a unique self-identifying keyword field)?

I want to be able to specify up to a few 100K of such unique ids.

Here's an example of my search specifying unique document identifiers:

{
    "_source" : "talentId",
    "from" : 0,
    "size" : 10000,
    "query" : {
        "bool" : {
            "must" : [
                {
                    "bool" : {
                        "must" : [  {  "match_phrase" : { "freeText" : "java" } },
                                    {  "match_phrase" : { "freeText" : "unix" } },
                                    {  "match_phrase" : { "freeText" : "c#" } },
                                    {  "match_phrase" : { "freeText" : "cnn" } } ]
                    }
                },
                {
                    "bool" : {
                        "filter" : {
                            "bool" : {
                                "should" : [
                                    {
                                        "terms" : {
                                            "talentId" : [ "goGSXMWE1Qg",  "GvTDYS6F1Qg",
                                                           "-qa_N-aC1Qg", "iu299LCC1Qg",
                                                           "0p7SpteI1Qg",  ... 4,995 more ...  ]
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

Answer 1

@jarmod is right.

But if you don't wanna completely redo your architecture, is there some other single talent -related shared field you could query instead of thousands of talendId s? It could be one more simple match_phrase query.

How to limit elasticsearch to a list of documents each identified by a unique keyword

Question

1 answers

solution1
0 2020-04-20 19:44:52

How to limit elasticsearch to a list of documents each identified by a unique keyword

Question

1 answers

solution1 0 2020-04-20 19:44:52

solution1
0 2020-04-20 19:44:52