How to allow searching with spacial character in Elasticsearch using Attachment plugin?

Question

I'm working on Spring Boot - JHipster base project.

I'm using Elasticsearch 6.8.6 with Attachment plugin. In that, content field have data of my document.

Now, When i search '192.168.31.167' it gives an appropriate result. But, when i search this "192.168.31.167:9200" it gives an empty result.

In short, it's not working with spacial characters. Can someone guide me. How to deal with this?

Mapping:

{
  "document" : {
    "mappings" : {
      "doc" : {
        "properties" : {
          "attachment" : {
            "properties" : {
              "content" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              },
              "content_length" : {
                "type" : "long"
              },
              "content_type" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },
          "content" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "createdDate" : {
            "type" : "date"
          },
          "holder" : {
            "type" : "long"
          },
          "id" : {
            "type" : "long"
          },
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "tag" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

Dummy Data:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "document",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "createdDate" : "2020-05-19T03:56:36+0000",
          "attachment" : {
            "content_type" : "text/plain; charset=ISO-8859-1",
            "content" : "version: '2'\nservices:\n  docy-kibana:\n    image: docker.elastic.co/kibana/kibana:6.8.6\n    ports:\n      - 5601:5601\n\n    environment:\n      SERVER_NAME: kibana.example.org\n      ELASTICSEARCH_HOSTS: http://192.168.31.167:9200/\n      XPACK_MONITORING_ENABLED: ${true}\n#      XPACK_ENCRYPTEDSAVEDOBJECTS.ENCRYPTIONKEY: test\n      XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED: ${true}",
            "content_length" : 390
          },
          "name" : "kibana_3_202005190926.yml",
          "holder" : 3,
          "id" : 1,
          "tag" : "configuration",
          "content" : "dmVyc2lvbjogJzInCnNlcnZpY2VzOgogIGRvY3kta2liYW5hOgogICAgaW1hZ2U6IGRvY2tlci5lbGFzdGljLmNvL2tpYmFuYS9raWJhbmE6Ni44LjYKICAgIHBvcnRzOgogICAgICAtIDU2MDE6NTYwMQoKICAgIGVudmlyb25tZW50OgogICAgICBTRVJWRVJfTkFNRToga2liYW5hLmV4YW1wbGUub3JnCiAgICAgIEVMQVNUSUNTRUFSQ0hfSE9TVFM6IGh0dHA6Ly8xOTIuMTY4LjMxLjE2Nzo5MjAwLwogICAgICBYUEFDS19NT05JVE9SSU5HX0VOQUJMRUQ6ICR7dHJ1ZX0KIyAgICAgIFhQQUNLX0VOQ1JZUFRFRFNBVkVET0JKRUNUUy5FTkNSWVBUSU9OS0VZOiB0ZXN0CiAgICAgIFhQQUNLX01PTklUT1JJTkdfVUlfQ09OVEFJTkVSX0VMQVNUSUNTRUFSQ0hfRU5BQkxFRDogJHt0cnVlfQo="
        }
      }
    ]
  }
}

Elasticsearch Request generated by code:

{
  "bool" : {
    "must" : [
      {
        "bool" : {
          "should" : [
            {
              "query_string" : {
                "query" : "*192.168.31.167:9200*",
                "fields" : [
                  "content^1.0",
                  "name^2.0",
                  "tag^3.0"
                ],
                "type" : "best_fields",
                "default_operator" : "or",
                "max_determinized_states" : 10000,
                "enable_position_increments" : true,
                "fuzziness" : "AUTO",
                "fuzzy_prefix_length" : 0,
                "fuzzy_max_expansions" : 50,
                "phrase_slop" : 0,
                "analyze_wildcard" : true,
                "escape" : false,
                "auto_generate_synonyms_phrase_query" : true,
                "fuzzy_transpositions" : true,
                "boost" : 1.0
              }
            },
            {
              "wildcard" : {
                "attachment.content" : {
                  "wildcard" : "*192.168.31.167:9200*",
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      },
      {
        "bool" : {
          "should" : [
            {
              "wildcard" : {
                "tag.keyword" : {
                  "wildcard" : "*information*",
                  "boost" : 1.0
                }
              }
            },
            {
              "wildcard" : {
                "tag.keyword" : {
                  "wildcard" : "*user*",
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      }
    ],
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}

Answer 1

Problem:

You are using the text field to query the data which uses standard analyzer and split text on : as shown in below analyze API call:

POST /_analyze
{
    "text" : "127.0.0.1:9200",
    "analyzer" : "standard"
}

generated tokens

{
    "tokens": [
        {
            "token": "127.0.0.1",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<NUM>",
            "position": 0
        },
        {
            "token": "9200",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<NUM>",
            "position": 1
        }
    ]
}

Solution - 1

Not optimized(wildcard query on the bigger index can cause severe perf issues) but as you are already using the wildcard it would work without changing analyzer and reindex the whole data(less overhead):

Use .keyword the field which is available on these text field which would not split the text into 2 tokens as shown below

{
    "tokens": [
        {
            "token": "127.0.0.1:9200",
            "start_offset": 0,
            "end_offset": 14,
            "type": "word",
            "position": 0
        }
    ]
}

You can add .keyword as shown below:

             "content.keyword^1.0",
              "name.keyword^2.0",
              "tag.keyword^3.0"

Solution- 2

Refer the solution mentioned in the comment by @val which would involve creating a custom analyzer and reindex the whole data which will create expected tokens in the index and then search on them without using the expensive regex. This will have a significantly better performance on large datasets but one time overhead of reindexing the whole data with new analyzer and queries.

Please choose whatever approach suits your business requirements better.

How to allow searching with spacial character in Elasticsearch using Attachment plugin?

Question

1 answers

solution1
2 2020-05-19 08:23:38

How to allow searching with spacial character in Elasticsearch using Attachment plugin?

Question

1 answers

solution1 2 2020-05-19 08:23:38

solution1
2 2020-05-19 08:23:38