在弹性搜索 Java 中搜索 substring

Question

I am working with elastic search and am trying to look for a substring inside a field.我正在使用弹性搜索，并试图在字段中寻找 substring。 For example - searching for the string tac in stack overflow .例如 - 在堆栈溢出中搜索字符串tac 。 I am using the MultiMatchQuery for this but it does not work.我为此使用 MultiMatchQuery，但它不起作用。 Here is a snippet of my code (first_name is the field name).这是我的代码片段（first_name 是字段名称）。

searchString = "*" + searchString.toLowerCase() + "*";
MultiMatchQueryBuilder mqb = new MultiMatchQueryBuilder("irs", first_name);
mqb.type(MultiMatchQueryBuilder.Type.PHRASE);
BoolQueryBuilder searchQuery = boolQuery();
searchQuery.should(mqb);
NativeSearchQueryBuilder queryBuilder = new NativeSearchQueryBuilder();
queryBuilder.withQuery(searchQuery);
NativeSearchQuery query = queryBuilder.build();

When I search for tac it does not return any results.当我搜索tac时，它不会返回任何结果。 When I search for stack or overflow it does return stack overflow .当我搜索堆栈或溢出时，它确实返回堆栈溢出。

So it looks for the exact string.所以它会寻找确切的字符串。 I tried using MultiMatchQueryBuilder.Type.PHRASE_PREFIX but it looks for the phrases starting with the substring.我尝试使用MultiMatchQueryBuilder.Type.PHRASE_PREFIX ，但它会查找以 substring 开头的短语。 It works with strings like stac or overf but not tac or tack .它适用于stac或overf等字符串，但不适用于tac或tack 。

Any suggestions on how to fix it?关于如何修复它的任何建议？

Answer 1

Macth query is analyzed and applied the same analyzer which is applied during the index time, I believe you are using the standard analyzer, which generated below tokens Macth 查询的分析和应用与索引期间应用的分析器相同，我相信您使用的是standard分析器，它在标记下生成

POST http://localhost:9200/_analyze

{
    "text": "stack overflow",
    "analyzer" : "standard"
}

{
    "tokens": [
        {
            "token": "stack",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "overflow",
            "start_offset": 6,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

Hence searching for tac doesn't match any token in an index, you need to change the analyzer so that it matches the query time tokens to index time tokens.因此，搜索tac与索引中的任何标记都不匹配，您需要更改分析器，使其将查询时间标记与索引时间标记匹配。

n-gram tokenizer can solve the issue. n-gram tokenizer可以解决这个问题。

Example例子

Index mapping索引映射

{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 10
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    },
    "index.max_ngram_diff" : 10
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete", 
        "search_analyzer": "standard" 
      }
    }
  }
}

Index sample doc索引示例文档

{
   "title" :  "stack overflow"
}

And search query和搜索查询

{
    "query": {
        "match": {
            "title": "tac"
        }
    }
}

And search result和搜索结果

"hits": [
            {
                "_index": "65241835",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.4739784,
                "_source": {
                    "title": "stack overflow"
                }
            }
        ]
    }

在弹性搜索 Java 中搜索 substring

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-11 02:10:08

在弹性搜索 Java 中搜索 substring

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-11 02:10:08

解决方案1
1 已采纳 2020-12-11 02:10:08