简体   繁体   中英

elasticsearch query_string handle special characters

My database is sync with an Elasticsearch to optimize our search results and request faster.

I have an issue querying the users, I want with a query therm look for my users, it can be part of a name, phone, ip, ...

My actual query is

query_string: { fields: ['id', 'email', 'firstName', 'lastName', 'phone', 'ip'], query: `*${escapeElastic(req.query.search.toString().toLowerCase())}*`}

Where req.query.search is my search and escapeElastic comes from the node module elasticsearch-sanitize because I had issues with some symbols.

I have some issue for example if I query for an ipv6, I will have query: '*2001\\:0db8*' but it will not find anything in the database and it should

Other issue if I have someone with firstName john-doe my query will be query: '*john\\-doe*' and it will not find any result.

Seems that the escape prevent query errors but create some issues in my case.

I do not know if query_string is the better way to do my request, I am open to suggestions to optimize this query

Thanks

I suspect the analyzer on your fields is standard or similar. This means chars like : and - were stripped:

GET _analyze
{
  "text": "John-Doe",
  "analyzer": "standard"
}

showing

{
  "tokens" : [
    {
      "token" : "john",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "doe",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Let's create our own analyzer which is going to keep the special chars but lowercase them all other chars the same time:

PUT multisearch
{
  "settings": {
    "analysis": {
      "analyzer": {
        "with_special_chars": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text",
        "fields": {
          "with_special_chars": {
            "type": "text",
            "analyzer": "with_special_chars"
          }
        }
      },
      "ip": {
        "type": "ip",
        "fields": {
          "with_special_chars": {
            "type": "text",
            "analyzer": "with_special_chars"
          }
        }
      }
    }
  }
}

Ingesting 2 sample docs:

POST multisearch/_doc
{
  "ip": "2001:0db8:85a3:0000:0000:8a2e:0370:7334"
}

POST multisearch/_doc
{
   "firstName": "John-Doe"
}

and applying your query from above:

GET multisearch/_search
{
  "query": {
    "query_string": {
      "fields": [
        "id",
        "email",
        "firstName.with_special_chars",
        "lastName",
        "phone",
        "ip.with_special_chars"
      ],
      "query": "2001\\:0db8* OR john-*"
    }
  }
}

both hits are returned.


Two remarks: 1) note that we were searching .with_special_chars instead of the main fields and 2) I've removed the leading wildcard from the ip -- those are highly inefficient.


Final tips since you asked for optimization suggestions: the query could be rewritten as

GET multisearch/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "id": "tegO63EBG_KW3EFnvQF8"
          }
        },
        {
          "match": {
            "email": "john@doe.com"
          }
        },
        {
          "match_phrase_prefix": {
            "firstName.with_special_chars": "john-d"
          }
        },
        {
          "match_phrase_prefix": {
            "firstName.with_special_chars": "john-d"
          }
        },
        {
          "match": {
            "phone.with_special_chars": "+151351"
          }
        },
        {
          "wildcard": {
            "ip.with_special_chars": {
              "value": "2001\\:0db8*"
            }
          }
        }
      ]
    }
  }
}
  1. Partial id matching is probably an overkill -- either the term catches it or not
  2. email can be simply match ed
  3. first- & lastName : I suspect match_phrase_prefix is more performant than wildcard or regexp so I'd go with that (as long as you don't need the leading * )
  4. phone can be match ed but do make sure special chars can be matched too (if you use the int'l format)
  5. use wildcard for the ip -- same syntax as in the query string

Try the above and see if you notice any speed improvements!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM