简体   繁体   English

Elasticsearch,如何将词接ngram呢?

[英]Elasticsearch, how to concatenate words then ngram it?

I'd like to concatenate words then ngram it. 我想将单词连接起来,然后再对其进行语法处理。
What's the correct setting for elasticsearch? 弹性搜索的正确setting是什么?

In english, 用英语,

from: stack overflow 来自: stack overflow

==> stackoverflow : concatenate first, ==> stackoverflow :首先连接,

==> sta / tac / ack / cko / kov / ... and etc (min_gram: 3, max_gram: 10) ==> sta / tac / ack / cko / kov / ...等(min_gram:3,max_gram:10)

To do the concatenation I'm assuming that you just want to remove all spaces from your input data. 为了进行串联,我假设您只想从输入数据中删除所有空格。 To do this, you need to implement a pattern_replace char filter that replaces space with nothing. 为此,您需要实现一个pattern_replace char过滤器 ,该过滤器可以用任何内容代替空格。

Setting up the ngram tokenizer should be easy - just specify your token min/max lengths. 设置ngram令牌生成器应该很容易-只需指定令牌的最小/最大长度即可。

It's worth adding a lowercase token filter too - to make searching case insensitive. 值得添加一个小写的令牌过滤器 -使搜索不区分大小写。

curl -XPOST localhost:9200/my_index -d '{
  "index": {
    "analysis": {
        "analyzer": {
            "my_new_analyzer": {
                "filter": [
                    "lowercase"
                ],
                "tokenizer": "my_ngram_tokenizer",
                "char_filter" : ["my_pattern"],
                "type": "custom"
            }
        },
       "char_filter" : {
          "my_pattern":{
            "type":"pattern_replace",
            "pattern":"\u0020",
            "replacement":""
           }
        }, 
        "tokenizer" : {
                "my_ngram_tokenizer" : {
                    "type" : "nGram",
                    "min_gram" : "3",
                    "max_gram" : "10",
                    "token_chars": ["letter", "digit", "punctuation", "symbol"]
                }
            }
    }
  }
}'

testing this: 测试此:

curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'

gives the following (just a small part shown below): 给出以下内容(如下所示只是一小部分):

{
"tokens" : [ {
  "token" : "sta",
  "start_offset" : 0,
  "end_offset" : 3,
  "type" : "word",
  "position" : 1
}, {
  "token" : "stac",
  "start_offset" : 0,
  "end_offset" : 4,
  "type" : "word",
  "position" : 2
}, {
  "token" : "stack",
  "start_offset" : 0,
  "end_offset" : 6,
  "type" : "word",
  "position" : 3
}, {
  "token" : "stacko",
  "start_offset" : 0,
  "end_offset" : 7,
  "type" : "word",
  "position" : 4
}, {
  "token" : "stackov",
  "start_offset" : 0,
  "end_offset" : 8,
  "type" : "word",
  "position" : 5
}, {

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM