简体   繁体   English

ElasticSearch查询优化-Java API

[英]ElasticSearch query optimization - Java API

I am newbie to ES and am searching on a record set of 100k data. 我是ES的新手,正在搜索10万条记录集。 Here is my mapping and setting JSON with which i have indexed my data: 这是我为数据建立索引的映射和设置JSON:

setings.json setings.json

{
    "index": {
        "analysis": {
            "tokenizer": {
                "ngram_tokenizer": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 10
                }
            },
            "analyzer": {
                "ngram_tokenizer_analyzer": {
                    "type": "custom",
                    "tokenizer": "ngram_tokenizer"
                }
            }
        }
    }
}

mappings.json mappings.json

{
    "product": {
        "properties": {
            "name": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "description": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "vendorModelNumber": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "brand": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "specifications": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "upc": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "storeSkuId": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "modelNumber": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            }
        }
    }
}

I need to query documents based on all the fields mentioned according to some priority. 我需要根据优先级根据提到的所有字段查询文档。 Here is my query to search for all the records. 这是我查询所有记录的查询。

BoolQueryBuilder query = QueryBuilders.boolQuery();
int boost = 7;

for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("name", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("description", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("modelNumber", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("vendorModelNumber", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("storeSkuId", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("upc", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("brand", "*" + str.toLowerCase() + "*").boost(boost));
}
client.prepareSearch(index).setQuery(query).setSize(200).setExplain(true).execute().actionGet();

The query does help me in searching data and works fine but my issue is that it takes a lot of time since I am using wildcard query. 该查询确实可以帮助我搜索数据并且可以正常工作,但是我的问题是自从使用通配符查询以来,这花费了很多时间。 Can someone please help in optimising this query or guide me in finding the best-suited query for my search? 有人可以帮助您优化此查询,还是可以指导我找到最适合我的查询的查询? TIA. TIA。

First off, let me answer the simple question first: handle case sensitivity. 首先,让我首先回答一个简单的问题:处理大小写。 If you define a custom analyzer, you can add different filters, which are applied to each token after the input has been processed by the tokenizer. 如果定义自定义分析器,则可以添加不同的过滤器,这些过滤器将在令牌化程序处理完输入应用于每个令牌。

{
"index": {
    "analysis": {
        "tokenizer": {
            "ngram_tokenizer": {
                "type": "ngram",
                "min_gram": 3,
                "max_gram": 10
            }
        },
        "analyzer": {
            "ngram_tokenizer_analyzer": {
                "type": "custom",
                "tokenizer": "ngram_tokenizer",
                "filter": [
                    "lowercase",
                    ...
                ]
            }
        }
    }
}

As you see, there is an existing lowercase filter, which will simply transform all tokens to lower case. 如您所见,有一个现有的小写过滤器,它将简单地将所有标记转换为小写。 I strongly recommend referring to the documentation . 我强烈建议参考文档 There are a lot of these token filters. 这些令牌过滤器很多


Now the more complicated part: NGram tokenizers. 现在更复杂的部分:NGram标记器。 Again, for deeper understanding, you might want to read docs . 同样,为了更深入地理解,您可能需要阅读docs But referring to your problem, your tokenizer will essentially create terms of length 3 to 10. Which means the text 但是提到您的问题时,您的令牌生成器实际上将创建长度为3到10的项。这意味着文本

I am an example TEXT.

Will basically create a lot of tokens. 基本上会创建很多令牌。 Just to show a few: 只是显示一些:

  • Size 3: "I a", " am", "am ", ..., "TEX", "EXT" 大小3:“ I a”,“ am”,“ am”,...,“ TEX”,“ EXT”
  • Size 4: "I am", " am ", "am a", ..., " TEX", "TEXT". 大小4:“我是”,“我”,“我是”,...,“ TEX”,“ TEXT”。
  • Size 10: "I am an ex", ... 大小10:“我是前任”,...

You get the idea. 你明白了。 (The lowercase token filter would lowercase these tokens now) 小写的令牌过滤器现在将小写这些令牌)

Difference between Match and Term Query: Match queries are analyzed, while term queries are not. 匹配查询和术语查询之间的区别:分析匹配查询,但不分析术语查询。 In fact, that means your match queries can match multiple terms. 实际上,这意味着您的匹配查询可以匹配多个词。 Example: you match exam" . 例如:您匹配exam"

This would match 3 terms in fact: exa , xam and exam . 这将匹配实际上3个方面: exaxamexam

This has influence on the score of the matches. 这对比赛的分数有影响。 The more matches, the higher the score. 比赛越多,得分越高。 In some cases it's desired, in other cases not. 在某些情况下是期望的,在其他情况下则是不希望的。

A term query is not analyzed, which means exam would match, but only one term ( exam of course). 不分析词条查询,这意味着exam会匹配,但只会匹配一个词条(当然是exam )。 However, since it's not analyzed, it's also not lowercased, meaning you have to do that in code yourself. 但是,由于没有对其进行分析,因此也没有将其小写,这意味着您必须自己编写代码。 Exam would never match, because there is no term with capital letters in your index, if you use the lowercase tokenfilter. Exam永远不会匹配,因为如果您使用小写的令牌过滤器,则索引中不会包含大写字母的词。

Not sure about your use-case. 不确定您的用例。 But I have a feeling, that you could (or even want) indeed use the term query. 但是我有一种感觉,您可以(甚至想要)确实使用术语查询。 But be aware, there are no terms in your index with a size bigger than 10. Because that's what your ngram-tokenizer does. 但是请注意,索引中没有大小大于10的术语。因为这就是您的ngram-tokenizer所做的。

/ EDIT: /编辑:

Something worth pointing out regarding match queries, and the reason why you might want to use terms: Some match queries like Simple will also match mple from example . 关于匹配查询,以及需要使用术语的原因,需要指出一些事情:一些Simple匹配查询(例如Simple也可以匹配example中的mple

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM