简体   繁体   中英

ElasticSearch query optimization - Java API

I am newbie to ES and am searching on a record set of 100k data. Here is my mapping and setting JSON with which i have indexed my data:

setings.json

{
    "index": {
        "analysis": {
            "tokenizer": {
                "ngram_tokenizer": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 10
                }
            },
            "analyzer": {
                "ngram_tokenizer_analyzer": {
                    "type": "custom",
                    "tokenizer": "ngram_tokenizer"
                }
            }
        }
    }
}

mappings.json

{
    "product": {
        "properties": {
            "name": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "description": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "vendorModelNumber": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "brand": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "specifications": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "upc": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "storeSkuId": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "modelNumber": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            }
        }
    }
}

I need to query documents based on all the fields mentioned according to some priority. Here is my query to search for all the records.

BoolQueryBuilder query = QueryBuilders.boolQuery();
int boost = 7;

for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("name", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("description", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("modelNumber", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("vendorModelNumber", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("storeSkuId", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("upc", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("brand", "*" + str.toLowerCase() + "*").boost(boost));
}
client.prepareSearch(index).setQuery(query).setSize(200).setExplain(true).execute().actionGet();

The query does help me in searching data and works fine but my issue is that it takes a lot of time since I am using wildcard query. Can someone please help in optimising this query or guide me in finding the best-suited query for my search? TIA.

First off, let me answer the simple question first: handle case sensitivity. If you define a custom analyzer, you can add different filters, which are applied to each token after the input has been processed by the tokenizer.

{
"index": {
    "analysis": {
        "tokenizer": {
            "ngram_tokenizer": {
                "type": "ngram",
                "min_gram": 3,
                "max_gram": 10
            }
        },
        "analyzer": {
            "ngram_tokenizer_analyzer": {
                "type": "custom",
                "tokenizer": "ngram_tokenizer",
                "filter": [
                    "lowercase",
                    ...
                ]
            }
        }
    }
}

As you see, there is an existing lowercase filter, which will simply transform all tokens to lower case. I strongly recommend referring to the documentation . There are a lot of these token filters.


Now the more complicated part: NGram tokenizers. Again, for deeper understanding, you might want to read docs . But referring to your problem, your tokenizer will essentially create terms of length 3 to 10. Which means the text

I am an example TEXT.

Will basically create a lot of tokens. Just to show a few:

  • Size 3: "I a", " am", "am ", ..., "TEX", "EXT"
  • Size 4: "I am", " am ", "am a", ..., " TEX", "TEXT".
  • Size 10: "I am an ex", ...

You get the idea. (The lowercase token filter would lowercase these tokens now)

Difference between Match and Term Query: Match queries are analyzed, while term queries are not. In fact, that means your match queries can match multiple terms. Example: you match exam" .

This would match 3 terms in fact: exa , xam and exam .

This has influence on the score of the matches. The more matches, the higher the score. In some cases it's desired, in other cases not.

A term query is not analyzed, which means exam would match, but only one term ( exam of course). However, since it's not analyzed, it's also not lowercased, meaning you have to do that in code yourself. Exam would never match, because there is no term with capital letters in your index, if you use the lowercase tokenfilter.

Not sure about your use-case. But I have a feeling, that you could (or even want) indeed use the term query. But be aware, there are no terms in your index with a size bigger than 10. Because that's what your ngram-tokenizer does.

/ EDIT:

Something worth pointing out regarding match queries, and the reason why you might want to use terms: Some match queries like Simple will also match mple from example .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM