[英]Elasticsearch, how to concatenate words then ngram it?
I'd like to concatenate words then ngram it. 我想将单词连接起来,然后再对其进行语法处理。
What's the correct setting
for elasticsearch? 弹性搜索的正确
setting
是什么?
In english, 用英语,
from: stack overflow
来自:
stack overflow
==> stackoverflow
: concatenate first, ==>
stackoverflow
:首先连接,
==> sta / tac / ack / cko / kov /
... and etc (min_gram: 3, max_gram: 10) ==>
sta / tac / ack / cko / kov /
...等(min_gram:3,max_gram:10)
To do the concatenation I'm assuming that you just want to remove all spaces from your input data. 为了进行串联,我假设您只想从输入数据中删除所有空格。 To do this, you need to implement a pattern_replace char filter that replaces space with nothing.
为此,您需要实现一个pattern_replace char过滤器 ,该过滤器可以用任何内容代替空格。
Setting up the ngram tokenizer should be easy - just specify your token min/max lengths. 设置ngram令牌生成器应该很容易-只需指定令牌的最小/最大长度即可。
It's worth adding a lowercase token filter too - to make searching case insensitive. 值得添加一个小写的令牌过滤器 -使搜索不区分大小写。
curl -XPOST localhost:9200/my_index -d '{
"index": {
"analysis": {
"analyzer": {
"my_new_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_ngram_tokenizer",
"char_filter" : ["my_pattern"],
"type": "custom"
}
},
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\u0020",
"replacement":""
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "10",
"token_chars": ["letter", "digit", "punctuation", "symbol"]
}
}
}
}
}'
testing this: 测试此:
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'
gives the following (just a small part shown below): 给出以下内容(如下所示只是一小部分):
{
"tokens" : [ {
"token" : "sta",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "stac",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "stack",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 3
}, {
"token" : "stacko",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "stackov",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.