简体   繁体   English

如何允许使用附件插件在 Elasticsearch 中使用空格字符进行搜索?

[英]How to allow searching with spacial character in Elasticsearch using Attachment plugin?

I'm working on Spring Boot - JHipster base project.我正在研究 Spring Boot - JHipster 基础项目。

I'm using Elasticsearch 6.8.6 with Attachment plugin.我正在使用带有附件插件的 Elasticsearch 6.8.6。 In that, content field have data of my document.在那,内容字段有我的文档的数据。

Now, When i search '192.168.31.167' it gives an appropriate result.现在,当我搜索“192.168.31.167”时,它会给出适当的结果。 But, when i search this "192.168.31.167:9200" it gives an empty result.但是,当我搜索这个“192.168.31.167:9200”时,它会给出一个空结果。

In short, it's not working with spacial characters.简而言之,它不适用于空间字符。 Can someone guide me.有人可以指导我。 How to deal with this?如何处理?

Mapping:映射:

{
  "document" : {
    "mappings" : {
      "doc" : {
        "properties" : {
          "attachment" : {
            "properties" : {
              "content" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              },
              "content_length" : {
                "type" : "long"
              },
              "content_type" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },
          "content" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "createdDate" : {
            "type" : "date"
          },
          "holder" : {
            "type" : "long"
          },
          "id" : {
            "type" : "long"
          },
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "tag" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

Dummy Data:虚拟数据:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "document",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "createdDate" : "2020-05-19T03:56:36+0000",
          "attachment" : {
            "content_type" : "text/plain; charset=ISO-8859-1",
            "content" : "version: '2'\nservices:\n  docy-kibana:\n    image: docker.elastic.co/kibana/kibana:6.8.6\n    ports:\n      - 5601:5601\n\n    environment:\n      SERVER_NAME: kibana.example.org\n      ELASTICSEARCH_HOSTS: http://192.168.31.167:9200/\n      XPACK_MONITORING_ENABLED: ${true}\n#      XPACK_ENCRYPTEDSAVEDOBJECTS.ENCRYPTIONKEY: test\n      XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED: ${true}",
            "content_length" : 390
          },
          "name" : "kibana_3_202005190926.yml",
          "holder" : 3,
          "id" : 1,
          "tag" : "configuration",
          "content" : "dmVyc2lvbjogJzInCnNlcnZpY2VzOgogIGRvY3kta2liYW5hOgogICAgaW1hZ2U6IGRvY2tlci5lbGFzdGljLmNvL2tpYmFuYS9raWJhbmE6Ni44LjYKICAgIHBvcnRzOgogICAgICAtIDU2MDE6NTYwMQoKICAgIGVudmlyb25tZW50OgogICAgICBTRVJWRVJfTkFNRToga2liYW5hLmV4YW1wbGUub3JnCiAgICAgIEVMQVNUSUNTRUFSQ0hfSE9TVFM6IGh0dHA6Ly8xOTIuMTY4LjMxLjE2Nzo5MjAwLwogICAgICBYUEFDS19NT05JVE9SSU5HX0VOQUJMRUQ6ICR7dHJ1ZX0KIyAgICAgIFhQQUNLX0VOQ1JZUFRFRFNBVkVET0JKRUNUUy5FTkNSWVBUSU9OS0VZOiB0ZXN0CiAgICAgIFhQQUNLX01PTklUT1JJTkdfVUlfQ09OVEFJTkVSX0VMQVNUSUNTRUFSQ0hfRU5BQkxFRDogJHt0cnVlfQo="
        }
      }
    ]
  }
}

Elasticsearch Request generated by code: Elasticsearch 代码生成的请求:

{
  "bool" : {
    "must" : [
      {
        "bool" : {
          "should" : [
            {
              "query_string" : {
                "query" : "*192.168.31.167:9200*",
                "fields" : [
                  "content^1.0",
                  "name^2.0",
                  "tag^3.0"
                ],
                "type" : "best_fields",
                "default_operator" : "or",
                "max_determinized_states" : 10000,
                "enable_position_increments" : true,
                "fuzziness" : "AUTO",
                "fuzzy_prefix_length" : 0,
                "fuzzy_max_expansions" : 50,
                "phrase_slop" : 0,
                "analyze_wildcard" : true,
                "escape" : false,
                "auto_generate_synonyms_phrase_query" : true,
                "fuzzy_transpositions" : true,
                "boost" : 1.0
              }
            },
            {
              "wildcard" : {
                "attachment.content" : {
                  "wildcard" : "*192.168.31.167:9200*",
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      },
      {
        "bool" : {
          "should" : [
            {
              "wildcard" : {
                "tag.keyword" : {
                  "wildcard" : "*information*",
                  "boost" : 1.0
                }
              }
            },
            {
              "wildcard" : {
                "tag.keyword" : {
                  "wildcard" : "*user*",
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      }
    ],
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}

Problem:问题:

You are using the text field to query the data which uses standard analyzer and split text on : as shown in below analyze API call:您正在使用text字段查询使用standard分析器和拆分文本的数据:如下所示分析 API调用:

POST /_analyze
{
    "text" : "127.0.0.1:9200",
    "analyzer" : "standard"
}

generated tokens生成的令牌

{
    "tokens": [
        {
            "token": "127.0.0.1",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<NUM>",
            "position": 0
        },
        {
            "token": "9200",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<NUM>",
            "position": 1
        }
    ]
}

Solution - 1解决方案 - 1

Not optimized(wildcard query on the bigger index can cause severe perf issues) but as you are already using the wildcard it would work without changing analyzer and reindex the whole data(less overhead):未优化(对较大索引的通配符查询可能会导致严重的性能问题),但由于您已经在使用通配符,它可以在不更改分析器并重新索引整个数据的情况下工作(减少开销):

Use .keyword the field which is available on these text field which would not split the text into 2 tokens as shown below使用.keyword在这些文本字段上可用的字段,它不会将文本拆分为 2 个标记,如下所示

{
    "tokens": [
        {
            "token": "127.0.0.1:9200",
            "start_offset": 0,
            "end_offset": 14,
            "type": "word",
            "position": 0
        }
    ]
}

You can add .keyword as shown below:您可以添加.keyword如下所示:

             "content.keyword^1.0",
              "name.keyword^2.0",
              "tag.keyword^3.0"

Solution- 2解决方案- 2

Refer the solution mentioned in the comment by @val which would involve creating a custom analyzer and reindex the whole data which will create expected tokens in the index and then search on them without using the expensive regex.请参阅@val 评论中提到的解决方案,这将涉及创建自定义分析器并重新索引整个数据,这将在索引中创建预期的标记,然后在不使用昂贵的正则表达式的情况下搜索它们。 This will have a significantly better performance on large datasets but one time overhead of reindexing the whole data with new analyzer and queries.这将在大型数据集上具有明显更好的性能,但使用新的分析器和查询重新索引整个数据会产生一次开销。

Please choose whatever approach suits your business requirements better.请选择更适合您的业务需求的任何方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Elasticsearch ingest-attachment插件索引pdf文件? - How to index a pdf file using Elasticsearch ingest-attachment plugin? 如何使用ingest-attachment插件索引Elasticsearch 5.0.0中的pdf文件? - How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin? 如何在 Java 中使用 Elasticsearch Ingest 附件处理器插件 - How to use Elasticsearch Ingest Attachment Processor Plugin in Java 安装Elasticsearch Mapper附件插件 - Install elasticsearch mapper attachment plugin elasticsearch 附件插件性能改进 - elasticsearch attachment plugin performance improvement 如何在Elasticsearch中将Mapper附件插件与JAVA API结合使用 - How to use mapper attachment plugin with JAVA API in elasticsearch 使用 ElasticSearch 搜索时如何忽略 URL? - How to ignore URL when searching using ElasticSearch? 如何将Elasticsearch Ingest附件处理器插件与Python软件包elasticsearch-dsl结合使用 - How do you use the Elasticsearch Ingest Attachment Processor Plugin with the Python package elasticsearch-dsl Elasticsearch附件插件与自己的Tika实现 - Elasticsearch attachment plugin vs own tika implementation 在 ElasticSearch 中摄取附件插件时出错 (NoClassDefFoundError) - Error in ingest attachment plugin in ElasticSearch (NoClassDefFoundError)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM