[英]How to allow searching with spacial character in Elasticsearch using Attachment plugin?
I'm working on Spring Boot - JHipster base project.我正在研究 Spring Boot - JHipster 基础项目。
I'm using Elasticsearch 6.8.6 with Attachment plugin.我正在使用带有附件插件的 Elasticsearch 6.8.6。 In that, content field have data of my document.在那,内容字段有我的文档的数据。
Now, When i search '192.168.31.167' it gives an appropriate result.现在,当我搜索“192.168.31.167”时,它会给出适当的结果。 But, when i search this "192.168.31.167:9200" it gives an empty result.但是,当我搜索这个“192.168.31.167:9200”时,它会给出一个空结果。
In short, it's not working with spacial characters.简而言之,它不适用于空间字符。 Can someone guide me.有人可以指导我。 How to deal with this?如何处理?
Mapping:映射:
{
"document" : {
"mappings" : {
"doc" : {
"properties" : {
"attachment" : {
"properties" : {
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"content_length" : {
"type" : "long"
},
"content_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"createdDate" : {
"type" : "date"
},
"holder" : {
"type" : "long"
},
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"tag" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
Dummy Data:虚拟数据:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "document",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"createdDate" : "2020-05-19T03:56:36+0000",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"content" : "version: '2'\nservices:\n docy-kibana:\n image: docker.elastic.co/kibana/kibana:6.8.6\n ports:\n - 5601:5601\n\n environment:\n SERVER_NAME: kibana.example.org\n ELASTICSEARCH_HOSTS: http://192.168.31.167:9200/\n XPACK_MONITORING_ENABLED: ${true}\n# XPACK_ENCRYPTEDSAVEDOBJECTS.ENCRYPTIONKEY: test\n XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED: ${true}",
"content_length" : 390
},
"name" : "kibana_3_202005190926.yml",
"holder" : 3,
"id" : 1,
"tag" : "configuration",
"content" : "dmVyc2lvbjogJzInCnNlcnZpY2VzOgogIGRvY3kta2liYW5hOgogICAgaW1hZ2U6IGRvY2tlci5lbGFzdGljLmNvL2tpYmFuYS9raWJhbmE6Ni44LjYKICAgIHBvcnRzOgogICAgICAtIDU2MDE6NTYwMQoKICAgIGVudmlyb25tZW50OgogICAgICBTRVJWRVJfTkFNRToga2liYW5hLmV4YW1wbGUub3JnCiAgICAgIEVMQVNUSUNTRUFSQ0hfSE9TVFM6IGh0dHA6Ly8xOTIuMTY4LjMxLjE2Nzo5MjAwLwogICAgICBYUEFDS19NT05JVE9SSU5HX0VOQUJMRUQ6ICR7dHJ1ZX0KIyAgICAgIFhQQUNLX0VOQ1JZUFRFRFNBVkVET0JKRUNUUy5FTkNSWVBUSU9OS0VZOiB0ZXN0CiAgICAgIFhQQUNLX01PTklUT1JJTkdfVUlfQ09OVEFJTkVSX0VMQVNUSUNTRUFSQ0hfRU5BQkxFRDogJHt0cnVlfQo="
}
}
]
}
}
Elasticsearch Request generated by code: Elasticsearch 代码生成的请求:
{
"bool" : {
"must" : [
{
"bool" : {
"should" : [
{
"query_string" : {
"query" : "*192.168.31.167:9200*",
"fields" : [
"content^1.0",
"name^2.0",
"tag^3.0"
],
"type" : "best_fields",
"default_operator" : "or",
"max_determinized_states" : 10000,
"enable_position_increments" : true,
"fuzziness" : "AUTO",
"fuzzy_prefix_length" : 0,
"fuzzy_max_expansions" : 50,
"phrase_slop" : 0,
"analyze_wildcard" : true,
"escape" : false,
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
},
{
"wildcard" : {
"attachment.content" : {
"wildcard" : "*192.168.31.167:9200*",
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
{
"bool" : {
"should" : [
{
"wildcard" : {
"tag.keyword" : {
"wildcard" : "*information*",
"boost" : 1.0
}
}
},
{
"wildcard" : {
"tag.keyword" : {
"wildcard" : "*user*",
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
Problem:问题:
You are using the text
field to query the data which uses standard
analyzer and split text on :
as shown in below analyze API call:您正在使用text
字段查询使用standard
分析器和拆分文本的数据:
如下所示分析 API调用:
POST /_analyze
{
"text" : "127.0.0.1:9200",
"analyzer" : "standard"
}
generated tokens生成的令牌
{
"tokens": [
{
"token": "127.0.0.1",
"start_offset": 0,
"end_offset": 9,
"type": "<NUM>",
"position": 0
},
{
"token": "9200",
"start_offset": 10,
"end_offset": 14,
"type": "<NUM>",
"position": 1
}
]
}
Solution - 1解决方案 - 1
Not optimized(wildcard query on the bigger index can cause severe perf issues) but as you are already using the wildcard it would work without changing analyzer and reindex the whole data(less overhead):未优化(对较大索引的通配符查询可能会导致严重的性能问题),但由于您已经在使用通配符,它可以在不更改分析器并重新索引整个数据的情况下工作(减少开销):
Use .keyword
the field which is available on these text field which would not split the text into 2 tokens as shown below使用.keyword
在这些文本字段上可用的字段,它不会将文本拆分为 2 个标记,如下所示
{
"tokens": [
{
"token": "127.0.0.1:9200",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
You can add .keyword
as shown below:您可以添加.keyword
如下所示:
"content.keyword^1.0",
"name.keyword^2.0",
"tag.keyword^3.0"
Solution- 2解决方案- 2
Refer the solution mentioned in the comment by @val which would involve creating a custom analyzer and reindex the whole data which will create expected tokens in the index and then search on them without using the expensive regex.请参阅@val 评论中提到的解决方案,这将涉及创建自定义分析器并重新索引整个数据,这将在索引中创建预期的标记,然后在不使用昂贵的正则表达式的情况下搜索它们。 This will have a significantly better performance on large datasets but one time overhead of reindexing the whole data with new analyzer and queries.这将在大型数据集上具有明显更好的性能,但使用新的分析器和查询重新索引整个数据会产生一次开销。
Please choose whatever approach suits your business requirements better.请选择更适合您的业务需求的任何方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.