简体   繁体   中英

How to limit elasticsearch to a list of documents each identified by a unique keyword

I have an elasticsearch document repository with ~15M documents.

Each document has an unique 11-char string field (comes from a mongo DB) that is unique to the document. This field is indexed as keyword .

I'm using C#.

When I run a search, I want to be able to limit the search to a set of documents that I specify (via some list of the unique field ids).

My query text uses bool with must to supply a filter for the unique identifiers and additional clauses to actually search the documents. See example below.

To search a large number of documents, I generate multiple query strings and run them concurrently. Each query handles up to 64K unique ids (determined by the limit on terms ).

In this case, I have 262,144 documents to search (list comes, at run time, from a separate mongo DB query). So my code generates 4 query strings (see example below).

I run them concurrently.

Unfortunately, this search takes over 22 seconds to complete.

When I run the same search but drop the terms node (so it searches all the documents), a single such query completes the search in 1.8 seconds.

An incredible difference.

So my question: Is there an efficient way to specify which documents are to be searched (when each document has a unique self-identifying keyword field)?

I want to be able to specify up to a few 100K of such unique ids.

Here's an example of my search specifying unique document identifiers:

{
    "_source" : "talentId",
    "from" : 0,
    "size" : 10000,
    "query" : {
        "bool" : {
            "must" : [
                {
                    "bool" : {
                        "must" : [  {  "match_phrase" : { "freeText" : "java" } },
                                    {  "match_phrase" : { "freeText" : "unix" } },
                                    {  "match_phrase" : { "freeText" : "c#" } },
                                    {  "match_phrase" : { "freeText" : "cnn" } } ]
                    }
                },
                {
                    "bool" : {
                        "filter" : {
                            "bool" : {
                                "should" : [
                                    {
                                        "terms" : {
                                            "talentId" : [ "goGSXMWE1Qg",  "GvTDYS6F1Qg",
                                                           "-qa_N-aC1Qg", "iu299LCC1Qg",
                                                           "0p7SpteI1Qg",  ... 4,995 more ...  ]
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
} 

@jarmod is right.

But if you don't wanna completely redo your architecture, is there some other single talent -related shared field you could query instead of thousands of talendId s? It could be one more simple match_phrase query.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM