简体   繁体   English

如何在弹性搜索中将术语与空格匹配?

[英]How to match terms with spaces in elasticsearch?

I have a content field (string) indexed in elasticsearch. 我有一个在elasticsearch中索引的内容字段(字符串)。 The analyzer is default one - standard analyzer. 分析仪是默认的单标准分析仪。

When I use match query to search: 当我使用匹配查询搜索时:

{"query":{"match":{"content":"micro soft", "operator":"and"}}}

Result shows it can't match "microsoft". 结果显示它无法匹配“microsoft”。

Then how to use input keyword "micro soft" to match the document content contains "microsoft"? 那么如何使用输入关键字“微软”来匹配文档内容包含“微软”?

Another solution to this is to use the nGram token filter, which would allow you to have a more "fuzzy" match. 另一个解决方案是使用nGram令牌过滤器,这将允许您进行更“模糊”的匹配。

Using your example for "microsoft" and "micro soft", here is an example of how an ngram token filter would break down the tokens: 使用“microsoft”和“micro soft”的示例,下面是一个ngram标记过滤器如何分解标记的示例:

POST /test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_ngrams": {
          "type": "ngram",
          "min_gram": "3",
          "max_gram": "5"
        }
      },
      "analyzer" : {
        "my_analyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter": ["my_ngrams"]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

And analyzing the two things: 并分析这两件事:

curl '0:9200/test/_analyze?field=body&pretty' -d'microsoft'
{
  "tokens" : [ {
    "token" : "mic"
  }, {
    "token" : "micr"
  }, {
    "token" : "micro"
  }, {
    "token" : "icr"
  }, {
    "token" : "icro"
  }, {
    "token" : "icros"
  }, {
    "token" : "cro"
  }, {
    "token" : "cros"
  }, {
    "token" : "croso"
  }, {
    "token" : "ros"
  }, {
    "token" : "roso"
  }, {
    "token" : "rosof"
  }, {
    "token" : "oso"
  }, {
    "token" : "osof"
  }, {
    "token" : "osoft"
  }, {
    "token" : "sof"
  }, {
    "token" : "soft"
  }, {
    "token" : "oft"
  } ]
}

curl '0:9200/test/_analyze?field=body&pretty' -d'micro soft'
{
  "tokens" : [ {
    "token" : "mic"
  }, {
    "token" : "micr"
  }, {
    "token" : "micro"
  }, {
    "token" : "icr"
  }, {
    "token" : "icro"
  }, {
    "token" : "cro"
  }, {
    "token" : "sof"
  }, {
    "token" : "soft"
  }, {
    "token" : "oft"
  } ]
}

(I cut out some of the output, full output here: https://gist.github.com/dakrone/10abb4a0cfe8ce8636ad ) (我在这里删掉了一些输出,完整输出: https//gist.github.com/dakrone/10abb4a0cfe8ce8636ad

As you can see, since the ngram terms for "microsoft" and "micro soft" overlap, you will be able to find matches for searches like this. 正如您所看到的,由于“微软”和“微软”的ngram术语重叠,您将能够找到此类搜索的匹配项。

Another approach to this problem is to do word decomposition you can either use a dictionary based approach: Compound Word Token Filter or to use a plugin which decomposes words algorithmically: Decompound plugin . 解决此问题的另一种方法是进行单词分解,您可以使用基于字典的方法: 复合词令牌过滤器或使用以算法方式分解单词的插件:解复用插件

The word microsoft would eg be split into following tokens: microsoft这个词会被分成以下标记:

{
   "tokens": [
      {
         "token": "microsoft",
      },
      {
         "token": "micro",
      },
      {
         "token": "soft",
      }
   ]
}

This tokens will allow you to search for partial words like you asked. 这个令牌将允许您搜索您提出的部分单词。

Compared to the ngrams approach mentioned in the other answer, this approach will result in a higher precision with only a slightly lower recall. 与其他答案中提到的ngrams方法相比,这种方法可以获得更高的精度,只有略低的召回率。

Try this ES wilcard as below 试试这个ES wilcard ,如下所示

 { 
 "query" : { 
     "bool" : { 
         "must" : { 
             "wildcard" : { "content":"micro*soft" } 
         } 
     } 
 }

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM