简体   繁体   中英

How to allow searching with spacial character in Elasticsearch using Attachment plugin?

I'm working on Spring Boot - JHipster base project.

I'm using Elasticsearch 6.8.6 with Attachment plugin. In that, content field have data of my document.

Now, When i search '192.168.31.167' it gives an appropriate result. But, when i search this "192.168.31.167:9200" it gives an empty result.

In short, it's not working with spacial characters. Can someone guide me. How to deal with this?

Mapping:

{
  "document" : {
    "mappings" : {
      "doc" : {
        "properties" : {
          "attachment" : {
            "properties" : {
              "content" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              },
              "content_length" : {
                "type" : "long"
              },
              "content_type" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },
          "content" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "createdDate" : {
            "type" : "date"
          },
          "holder" : {
            "type" : "long"
          },
          "id" : {
            "type" : "long"
          },
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "tag" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

Dummy Data:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "document",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "createdDate" : "2020-05-19T03:56:36+0000",
          "attachment" : {
            "content_type" : "text/plain; charset=ISO-8859-1",
            "content" : "version: '2'\nservices:\n  docy-kibana:\n    image: docker.elastic.co/kibana/kibana:6.8.6\n    ports:\n      - 5601:5601\n\n    environment:\n      SERVER_NAME: kibana.example.org\n      ELASTICSEARCH_HOSTS: http://192.168.31.167:9200/\n      XPACK_MONITORING_ENABLED: ${true}\n#      XPACK_ENCRYPTEDSAVEDOBJECTS.ENCRYPTIONKEY: test\n      XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED: ${true}",
            "content_length" : 390
          },
          "name" : "kibana_3_202005190926.yml",
          "holder" : 3,
          "id" : 1,
          "tag" : "configuration",
          "content" : "dmVyc2lvbjogJzInCnNlcnZpY2VzOgogIGRvY3kta2liYW5hOgogICAgaW1hZ2U6IGRvY2tlci5lbGFzdGljLmNvL2tpYmFuYS9raWJhbmE6Ni44LjYKICAgIHBvcnRzOgogICAgICAtIDU2MDE6NTYwMQoKICAgIGVudmlyb25tZW50OgogICAgICBTRVJWRVJfTkFNRToga2liYW5hLmV4YW1wbGUub3JnCiAgICAgIEVMQVNUSUNTRUFSQ0hfSE9TVFM6IGh0dHA6Ly8xOTIuMTY4LjMxLjE2Nzo5MjAwLwogICAgICBYUEFDS19NT05JVE9SSU5HX0VOQUJMRUQ6ICR7dHJ1ZX0KIyAgICAgIFhQQUNLX0VOQ1JZUFRFRFNBVkVET0JKRUNUUy5FTkNSWVBUSU9OS0VZOiB0ZXN0CiAgICAgIFhQQUNLX01PTklUT1JJTkdfVUlfQ09OVEFJTkVSX0VMQVNUSUNTRUFSQ0hfRU5BQkxFRDogJHt0cnVlfQo="
        }
      }
    ]
  }
}

Elasticsearch Request generated by code:

{
  "bool" : {
    "must" : [
      {
        "bool" : {
          "should" : [
            {
              "query_string" : {
                "query" : "*192.168.31.167:9200*",
                "fields" : [
                  "content^1.0",
                  "name^2.0",
                  "tag^3.0"
                ],
                "type" : "best_fields",
                "default_operator" : "or",
                "max_determinized_states" : 10000,
                "enable_position_increments" : true,
                "fuzziness" : "AUTO",
                "fuzzy_prefix_length" : 0,
                "fuzzy_max_expansions" : 50,
                "phrase_slop" : 0,
                "analyze_wildcard" : true,
                "escape" : false,
                "auto_generate_synonyms_phrase_query" : true,
                "fuzzy_transpositions" : true,
                "boost" : 1.0
              }
            },
            {
              "wildcard" : {
                "attachment.content" : {
                  "wildcard" : "*192.168.31.167:9200*",
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      },
      {
        "bool" : {
          "should" : [
            {
              "wildcard" : {
                "tag.keyword" : {
                  "wildcard" : "*information*",
                  "boost" : 1.0
                }
              }
            },
            {
              "wildcard" : {
                "tag.keyword" : {
                  "wildcard" : "*user*",
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      }
    ],
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}

Problem:

You are using the text field to query the data which uses standard analyzer and split text on : as shown in below analyze API call:

POST /_analyze
{
    "text" : "127.0.0.1:9200",
    "analyzer" : "standard"
}

generated tokens

{
    "tokens": [
        {
            "token": "127.0.0.1",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<NUM>",
            "position": 0
        },
        {
            "token": "9200",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<NUM>",
            "position": 1
        }
    ]
}

Solution - 1

Not optimized(wildcard query on the bigger index can cause severe perf issues) but as you are already using the wildcard it would work without changing analyzer and reindex the whole data(less overhead):

Use .keyword the field which is available on these text field which would not split the text into 2 tokens as shown below

{
    "tokens": [
        {
            "token": "127.0.0.1:9200",
            "start_offset": 0,
            "end_offset": 14,
            "type": "word",
            "position": 0
        }
    ]
}

You can add .keyword as shown below:

             "content.keyword^1.0",
              "name.keyword^2.0",
              "tag.keyword^3.0"

Solution- 2

Refer the solution mentioned in the comment by @val which would involve creating a custom analyzer and reindex the whole data which will create expected tokens in the index and then search on them without using the expensive regex. This will have a significantly better performance on large datasets but one time overhead of reindexing the whole data with new analyzer and queries.

Please choose whatever approach suits your business requirements better.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM