简体   繁体   English

在弹性搜索 Java 中搜索 substring

[英]Search for substring in Elastic Search Java

I am working with elastic search and am trying to look for a substring inside a field.我正在使用弹性搜索,并试图在字段中寻找 substring。 For example - searching for the string tac in stack overflow .例如 - 在堆栈溢出中搜索字符串tac I am using the MultiMatchQuery for this but it does not work.我为此使用 MultiMatchQuery,但它不起作用。 Here is a snippet of my code (first_name is the field name).这是我的代码片段(first_name 是字段名称)。

searchString = "*" + searchString.toLowerCase() + "*";
MultiMatchQueryBuilder mqb = new MultiMatchQueryBuilder("irs", first_name);
mqb.type(MultiMatchQueryBuilder.Type.PHRASE);
BoolQueryBuilder searchQuery = boolQuery();
searchQuery.should(mqb);
NativeSearchQueryBuilder queryBuilder = new NativeSearchQueryBuilder();
queryBuilder.withQuery(searchQuery);
NativeSearchQuery query = queryBuilder.build();

When I search for tac it does not return any results.当我搜索tac时,它不会返回任何结果。 When I search for stack or overflow it does return stack overflow .当我搜索堆栈溢出时,它确实返回堆栈溢出

So it looks for the exact string.所以它会寻找确切的字符串。 I tried using MultiMatchQueryBuilder.Type.PHRASE_PREFIX but it looks for the phrases starting with the substring.我尝试使用MultiMatchQueryBuilder.Type.PHRASE_PREFIX ,但它会查找以 substring 开头的短语。 It works with strings like stac or overf but not tac or tack .它适用于stacoverf等字符串,但不适用于tactack

Any suggestions on how to fix it?关于如何修复它的任何建议?

Macth query is analyzed and applied the same analyzer which is applied during the index time, I believe you are using the standard analyzer, which generated below tokens Macth 查询的分析和应用与索引期间应用的分析器相同,我相信您使用的是standard分析器,它在标记下生成

POST http://localhost:9200/_analyze

{
    "text": "stack overflow",
    "analyzer" : "standard"
}

{
    "tokens": [
        {
            "token": "stack",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "overflow",
            "start_offset": 6,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

Hence searching for tac doesn't match any token in an index, you need to change the analyzer so that it matches the query time tokens to index time tokens.因此,搜索tac与索引中的任何标记都不匹配,您需要更改分析器,使其将查询时间标记与索引时间标记匹配。

n-gram tokenizer can solve the issue. n-gram tokenizer可以解决这个问题。

Example例子

Index mapping索引映射

{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 10
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    },
    "index.max_ngram_diff" : 10
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete", 
        "search_analyzer": "standard" 
      }
    }
  }
}

Index sample doc索引示例文档

{
   "title" :  "stack overflow"
}

And search query和搜索查询

{
    "query": {
        "match": {
            "title": "tac"
        }
    }
}

And search result和搜索结果

"hits": [
            {
                "_index": "65241835",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.4739784,
                "_source": {
                    "title": "stack overflow"
                }
            }
        ]
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM